[SPARK-17219][ML] Add NaN value handling in Bucketizer - spark

diff options

author	VinceShieh <vincent.xie@intel.com>	2016-09-21 10:20:57 +0100
committer	Sean Owen <sowen@cloudera.com>	2016-09-21 10:20:57 +0100
commit	57dc326bd00cf0a49da971e9c573c48ae28acaa2 (patch)
tree	3e57f59b33e42beddd1df72dbd85266e7b09ef7f /python/pyspark/sql/streaming.py
parent	b366f18496e1ce8bd20fe58a0245ef7d91819a03 (diff)
download	spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.tar.gz spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.tar.bz2 spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.zip

[SPARK-17219][ML] Add NaN value handling in Bucketizer

## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.

Diffstat (limited to 'python/pyspark/sql/streaming.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: