diff options
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r-- | docs/ml-features.md | 6 |
1 files changed, 5 insertions, 1 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index 746593fb9e..a39b31c8f7 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1102,7 +1102,11 @@ for more details on the API. ## QuantileDiscretizer `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned -categorical features. The number of bins is set by the `numBuckets` parameter. +categorical features. The number of bins is set by the `numBuckets` parameter. It is possible +that the number of buckets used will be less than this value, for example, if there are too few +distinct values of the input to create enough distinct quantiles. Note also that NaN values are +handled specially and placed into their own bucket. For example, if 4 buckets are used, then +non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. The bin ranges are chosen using an approximate algorithm (see the documentation for [approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a detailed description). The precision of the approximation can be controlled with the |