[SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark

## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: https://github.com/apache/spark/pull/15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #16922 from VinceShieh/spark-19590.
author: VinceShieh <vincent.xie@intel.com> 2017-02-15 10:12:07 -0800
committer: Holden Karau <holden@us.ibm.com> 2017-02-15 10:12:07 -0800
commit: 6eca21ba881120f1ac7854621380ef8a92972384 (patch)
tree: 6185cae2b129369cea41149141af37592cc16a4c /python/pyspark/ml/feature.py
parent: acf71c63cdde8dced8d108260cdd35e1cc992248 (diff)
download: spark-6eca21ba881120f1ac7854621380ef8a92972384.tar.gz
spark-6eca21ba881120f1ac7854621380ef8a92972384.tar.bz2
spark-6eca21ba881120f1ac7854621380ef8a92972384.zip
1 files changed, 11 insertions, 1 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index ac90c899d9..1ab42919ea 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab
 
     `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
     categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter.
-    The bin ranges are chosen using an approximate algorithm (see the documentation for
+    It is possible that the number of buckets used will be less than this value, for example, if
+    there are too few distinct values of the input to create enough distinct quantiles.
+
+    NaN handling: Note also that
+    QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user
+    can also choose to either keep or remove NaN values within the dataset by setting
+    :py:attr:`handleInvalid` parameter. If the user chooses to keep NaN values, they will be
+    handled specially and placed into their own bucket, for example, if 4 buckets are used, then
+    non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
+
+    Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
     :py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description).
     The precision of the approximation can be controlled with the
     :py:attr:`relativeError` parameter.
author	VinceShieh <vincent.xie@intel.com>	2017-02-15 10:12:07 -0800
committer	Holden Karau <holden@us.ibm.com>	2017-02-15 10:12:07 -0800
commit	6eca21ba881120f1ac7854621380ef8a92972384 (patch)
tree	6185cae2b129369cea41149141af37592cc16a4c /python/pyspark/ml/feature.py
parent	acf71c63cdde8dced8d108260cdd35e1cc992248 (diff)
download	spark-6eca21ba881120f1ac7854621380ef8a92972384.tar.gz spark-6eca21ba881120f1ac7854621380ef8a92972384.tar.bz2 spark-6eca21ba881120f1ac7854621380ef8a92972384.zip