aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
authorVinceShieh <vincent.xie@intel.com>2016-10-27 11:52:15 -0700
committerJoseph K. Bradley <joseph@databricks.com>2016-10-27 11:52:15 -0700
commit0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7 (patch)
tree6a7af2f31ce677e529b631d726077b58e78490da /docs/ml-features.md
parent104232580528c097a284d753adb5795f6de8b0a5 (diff)
downloadspark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.gz
spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.bz2
spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.zip
[SPARK-17219][ML] enhanced NaN value handling in Bucketizer
## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Author: Vincent Xie <vincent.xie@intel.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15428 from VinceShieh/spark-17219_followup.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md15
1 files changed, 10 insertions, 5 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index a7f710fa52..64c6a16023 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1103,11 +1103,16 @@ for more details on the API.
`QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
categorical features. The number of bins is set by the `numBuckets` parameter. It is possible
-that the number of buckets used will be less than this value, for example, if there are too few
-distinct values of the input to create enough distinct quantiles. Note also that NaN values are
-handled specially and placed into their own bucket. For example, if 4 buckets are used, then
-non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
-The bin ranges are chosen using an approximate algorithm (see the documentation for
+that the number of buckets used will be smaller than this value, for example, if there are too few
+distinct values of the input to create enough distinct quantiles.
+
+NaN values: Note also that QuantileDiscretizer
+will raise an error when it finds NaN values in the dataset, but the user can also choose to either
+keep or remove NaN values within the dataset by setting `handleInvalid`. If the user chooses to keep
+NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets
+are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
+
+Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a
detailed description). The precision of the approximation can be controlled with the
`relativeError` parameter. When set to zero, exact quantiles are calculated