aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rwxr-xr-xpython/pyspark/ml/feature.py12
1 files changed, 11 insertions, 1 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index ac90c899d9..1ab42919ea 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab
`QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter.
- The bin ranges are chosen using an approximate algorithm (see the documentation for
+ It is possible that the number of buckets used will be less than this value, for example, if
+ there are too few distinct values of the input to create enough distinct quantiles.
+
+ NaN handling: Note also that
+ QuantileDiscretizer will raise an error when it finds NaN values in the dataset, but the user
+ can also choose to either keep or remove NaN values within the dataset by setting
+ :py:attr:`handleInvalid` parameter. If the user chooses to keep NaN values, they will be
+ handled specially and placed into their own bucket, for example, if 4 buckets are used, then
+ non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
+
+ Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
:py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description).
The precision of the approximation can be controlled with the
:py:attr:`relativeError` parameter.