[SPARK-17219][ML] enhanced NaN value handling in Bucketizer

## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Author: Vincent Xie <vincent.xie@intel.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15428 from VinceShieh/spark-17219_followup.
author: VinceShieh <vincent.xie@intel.com> 2016-10-27 11:52:15 -0700
committer: Joseph K. Bradley <joseph@databricks.com> 2016-10-27 11:52:15 -0700
commit: 0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7 (patch)
tree: 6a7af2f31ce677e529b631d726077b58e78490da /python/pyspark/ml/feature.py
parent: 104232580528c097a284d753adb5795f6de8b0a5 (diff)
download: spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.gz
spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.bz2
spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.zip
1 files changed, 0 insertions, 5 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 7683360664..94afe82a36 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -1155,11 +1155,6 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab
 
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter.
-    It is possible that the number of buckets used will be less than this value, for example, if
-    there are too few distinct values of the input to create enough distinct quantiles. Note also
-    that NaN values are handled specially and placed into their own bucket. For example, if 4
-    buckets are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in
-    a special bucket(4).
     The bin ranges are chosen using an approximate algorithm (see the documentation for
     :py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description).
     The precision of the approximation can be controlled with the
author	VinceShieh <vincent.xie@intel.com>	2016-10-27 11:52:15 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2016-10-27 11:52:15 -0700
commit	0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7 (patch)
tree	6a7af2f31ce677e529b631d726077b58e78490da /python/pyspark/ml/feature.py
parent	104232580528c097a284d753adb5795f6de8b0a5 (diff)
download	spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.gz spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.bz2 spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.zip