aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorYuhao Yang <yuhao.yang@intel.com>2017-04-03 11:42:33 +0200
committerNick Pentreath <nickp@za.ibm.com>2017-04-03 11:42:33 +0200
commit4d28e8430d11323f08657ca8f3251ca787c45501 (patch)
tree6bc6a719c21f2a6c973419bfbf7776284df3a523 /docs
parentfb5869f2cf94217b3e254e2d0820507dc83a25cc (diff)
downloadspark-4d28e8430d11323f08657ca8f3251ca787c45501.tar.gz
spark-4d28e8430d11323f08657ca8f3251ca787c45501.tar.bz2
spark-4d28e8430d11323f08657ca8f3251ca787c45501.zip
[SPARK-19969][ML] Imputer doc and example
## What changes were proposed in this pull request? Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316 ## How was this patch tested? local doc generation and example execution Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17324 from hhbyyh/imputerdoc.
Diffstat (limited to 'docs')
-rw-r--r--docs/ml-features.md66
1 files changed, 66 insertions, 0 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index dad1c6db18..e19fba249f 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1284,6 +1284,72 @@ for more details on the API.
</div>
+
+## Imputer
+
+The `Imputer` transformer completes missing values in a dataset, either using the mean or the
+median of the columns in which the missing values are located. The input columns should be of
+`DoubleType` or `FloatType`. Currently `Imputer` does not support categorical features and possibly
+creates incorrect values for columns containing categorical features.
+
+**Note** all `null` values in the input columns are treated as missing, and so are also imputed.
+
+**Examples**
+
+Suppose that we have a DataFrame with the columns `a` and `b`:
+
+~~~
+ a | b
+------------|-----------
+ 1.0 | Double.NaN
+ 2.0 | Double.NaN
+ Double.NaN | 3.0
+ 4.0 | 4.0
+ 5.0 | 5.0
+~~~
+
+In this example, Imputer will replace all occurrences of `Double.NaN` (the default for the missing value)
+with the mean (the default imputation strategy) computed from the other values in the corresponding columns.
+In this example, the surrogate values for columns `a` and `b` are 3.0 and 4.0 respectively. After
+transformation, the missing values in the output columns will be replaced by the surrogate value for
+the relevant column.
+
+~~~
+ a | b | out_a | out_b
+------------|------------|-------|-------
+ 1.0 | Double.NaN | 1.0 | 4.0
+ 2.0 | Double.NaN | 2.0 | 4.0
+ Double.NaN | 3.0 | 3.0 | 3.0
+ 4.0 | 4.0 | 4.0 | 4.0
+ 5.0 | 5.0 | 5.0 | 5.0
+~~~
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [Imputer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Imputer)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+Refer to the [Imputer Java docs](api/java/org/apache/spark/ml/feature/Imputer.html)
+for more details on the API.
+
+{% include_example java/org/apache/spark/examples/ml/JavaImputerExample.java %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+Refer to the [Imputer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Imputer)
+for more details on the API.
+
+{% include_example python/ml/imputer_example.py %}
+</div>
+</div>
+
# Feature Selectors
## VectorSlicer