aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/ml/feature.py
diff options
context:
space:
mode:
authorVinceShieh <vincent.xie@intel.com>2016-09-21 10:20:57 +0100
committerSean Owen <sowen@cloudera.com>2016-09-21 10:20:57 +0100
commit57dc326bd00cf0a49da971e9c573c48ae28acaa2 (patch)
tree3e57f59b33e42beddd1df72dbd85266e7b09ef7f /python/pyspark/ml/feature.py
parentb366f18496e1ce8bd20fe58a0245ef7d91819a03 (diff)
downloadspark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.tar.gz
spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.tar.bz2
spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.zip
[SPARK-17219][ML] Add NaN value handling in Bucketizer
## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.
Diffstat (limited to 'python/pyspark/ml/feature.py')
-rwxr-xr-xpython/pyspark/ml/feature.py5
1 files changed, 5 insertions, 0 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 2881380152..c45434f1a5 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -1155,6 +1155,11 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab
`QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter.
+ It is possible that the number of buckets used will be less than this value, for example, if
+ there are too few distinct values of the input to create enough distinct quantiles. Note also
+ that NaN values are handled specially and placed into their own bucket. For example, if 4
+ buckets are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in
+ a special bucket(4).
The bin ranges are chosen using an approximate algorithm (see the documentation for
:py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description).
The precision of the approximation can be controlled with the