[SPARK-17219][ML] enhanced NaN value handling in Bucketizer

## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Author: Vincent Xie <vincent.xie@intel.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15428 from VinceShieh/spark-17219_followup.
author: VinceShieh <vincent.xie@intel.com> 2016-10-27 11:52:15 -0700
committer: Joseph K. Bradley <joseph@databricks.com> 2016-10-27 11:52:15 -0700
commit: 0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7 (patch)
tree: 6a7af2f31ce677e529b631d726077b58e78490da /docs/ml-features.md
parent: 104232580528c097a284d753adb5795f6de8b0a5 (diff)
download: spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.gz
spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.bz2
spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.zip
1 files changed, 10 insertions, 5 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index a7f710fa52..64c6a16023 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1103,11 +1103,16 @@ for more details on the API.
 
 `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
 categorical features. The number of bins is set by the `numBuckets` parameter. It is possible
-that the number of buckets used will be less than this value, for example, if there are too few
-distinct values of the input to create enough distinct quantiles. Note also that NaN values are
-handled specially and placed into their own bucket. For example, if 4 buckets are used, then
-non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
-The bin ranges are chosen using an approximate algorithm (see the documentation for
+that the number of buckets used will be smaller than this value, for example, if there are too few
+distinct values of the input to create enough distinct quantiles.
+
+NaN values: Note also that QuantileDiscretizer
+will raise an error when it finds NaN values in the dataset, but the user can also choose to either
+keep or remove NaN values within the dataset by setting `handleInvalid`. If the user chooses to keep
+NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets
+are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
+
+Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
 [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a
 detailed description). The precision of the approximation can be controlled with the
 `relativeError` parameter. When set to zero, exact quantiles are calculated
author	VinceShieh <vincent.xie@intel.com>	2016-10-27 11:52:15 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2016-10-27 11:52:15 -0700
commit	0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7 (patch)
tree	6a7af2f31ce677e529b631d726077b58e78490da /docs/ml-features.md
parent	104232580528c097a284d753adb5795f6de8b0a5 (diff)
download	spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.gz spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.bz2 spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.zip