diff options
author | Yuhao Yang <yuhao.yang@intel.com> | 2017-04-03 11:42:33 +0200 |
---|---|---|
committer | Nick Pentreath <nickp@za.ibm.com> | 2017-04-03 11:42:33 +0200 |
commit | 4d28e8430d11323f08657ca8f3251ca787c45501 (patch) | |
tree | 6bc6a719c21f2a6c973419bfbf7776284df3a523 /docs/ml-features.md | |
parent | fb5869f2cf94217b3e254e2d0820507dc83a25cc (diff) | |
download | spark-4d28e8430d11323f08657ca8f3251ca787c45501.tar.gz spark-4d28e8430d11323f08657ca8f3251ca787c45501.tar.bz2 spark-4d28e8430d11323f08657ca8f3251ca787c45501.zip |
[SPARK-19969][ML] Imputer doc and example
## What changes were proposed in this pull request?
Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316
## How was this patch tested?
local doc generation and example execution
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes #17324 from hhbyyh/imputerdoc.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r-- | docs/ml-features.md | 66 |
1 files changed, 66 insertions, 0 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index dad1c6db18..e19fba249f 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1284,6 +1284,72 @@ for more details on the API. </div> + +## Imputer + +The `Imputer` transformer completes missing values in a dataset, either using the mean or the +median of the columns in which the missing values are located. The input columns should be of +`DoubleType` or `FloatType`. Currently `Imputer` does not support categorical features and possibly +creates incorrect values for columns containing categorical features. + +**Note** all `null` values in the input columns are treated as missing, and so are also imputed. + +**Examples** + +Suppose that we have a DataFrame with the columns `a` and `b`: + +~~~ + a | b +------------|----------- + 1.0 | Double.NaN + 2.0 | Double.NaN + Double.NaN | 3.0 + 4.0 | 4.0 + 5.0 | 5.0 +~~~ + +In this example, Imputer will replace all occurrences of `Double.NaN` (the default for the missing value) +with the mean (the default imputation strategy) computed from the other values in the corresponding columns. +In this example, the surrogate values for columns `a` and `b` are 3.0 and 4.0 respectively. After +transformation, the missing values in the output columns will be replaced by the surrogate value for +the relevant column. + +~~~ + a | b | out_a | out_b +------------|------------|-------|------- + 1.0 | Double.NaN | 1.0 | 4.0 + 2.0 | Double.NaN | 2.0 | 4.0 + Double.NaN | 3.0 | 3.0 | 3.0 + 4.0 | 4.0 | 4.0 | 4.0 + 5.0 | 5.0 | 5.0 | 5.0 +~~~ + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +Refer to the [Imputer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Imputer) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %} +</div> + +<div data-lang="java" markdown="1"> + +Refer to the [Imputer Java docs](api/java/org/apache/spark/ml/feature/Imputer.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaImputerExample.java %} +</div> + +<div data-lang="python" markdown="1"> + +Refer to the [Imputer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Imputer) +for more details on the API. + +{% include_example python/ml/imputer_example.py %} +</div> +</div> + # Feature Selectors ## VectorSlicer |