aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md6
1 files changed, 5 insertions, 1 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 746593fb9e..a39b31c8f7 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1102,7 +1102,11 @@ for more details on the API.
## QuantileDiscretizer
`QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
-categorical features. The number of bins is set by the `numBuckets` parameter.
+categorical features. The number of bins is set by the `numBuckets` parameter. It is possible
+that the number of buckets used will be less than this value, for example, if there are too few
+distinct values of the input to create enough distinct quantiles. Note also that NaN values are
+handled specially and placed into their own bucket. For example, if 4 buckets are used, then
+non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
The bin ranges are chosen using an approximate algorithm (see the documentation for
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a
detailed description). The precision of the approximation can be controlled with the