diff options
author | VinceShieh <vincent.xie@intel.com> | 2016-10-27 11:52:15 -0700 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2016-10-27 11:52:15 -0700 |
commit | 0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7 (patch) | |
tree | 6a7af2f31ce677e529b631d726077b58e78490da /python/pyspark/ml/feature.py | |
parent | 104232580528c097a284d753adb5795f6de8b0a5 (diff) | |
download | spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.gz spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.tar.bz2 spark-0b076d4cb6afde2946124e6411ed6a6ce7b8b1a7.zip |
[SPARK-17219][ML] enhanced NaN value handling in Bucketizer
## What changes were proposed in this pull request?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")
## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Author: Vincent Xie <vincent.xie@intel.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #15428 from VinceShieh/spark-17219_followup.
Diffstat (limited to 'python/pyspark/ml/feature.py')
-rwxr-xr-x | python/pyspark/ml/feature.py | 5 |
1 files changed, 0 insertions, 5 deletions
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 7683360664..94afe82a36 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -1155,11 +1155,6 @@ class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadab `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the :py:attr:`numBuckets` parameter. - It is possible that the number of buckets used will be less than this value, for example, if - there are too few distinct values of the input to create enough distinct quantiles. Note also - that NaN values are handled specially and placed into their own bucket. For example, if 4 - buckets are used, then non-NaN data will be put into buckets(0-3), but NaNs will be counted in - a special bucket(4). The bin ranges are chosen using an approximate algorithm (see the documentation for :py:meth:`~.DataFrameStatFunctions.approxQuantile` for a detailed description). The precision of the approximation can be controlled with the |