diff options
author | VinceShieh <vincent.xie@intel.com> | 2016-09-21 10:20:57 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-09-21 10:20:57 +0100 |
commit | 57dc326bd00cf0a49da971e9c573c48ae28acaa2 (patch) | |
tree | 3e57f59b33e42beddd1df72dbd85266e7b09ef7f /sql/core/src | |
parent | b366f18496e1ce8bd20fe58a0245ef7d91819a03 (diff) | |
download | spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.tar.gz spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.tar.bz2 spark-57dc326bd00cf0a49da971e9c573c48ae28acaa2.zip |
[SPARK-17219][ML] Add NaN value handling in Bucketizer
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Closes #14858 from VinceShieh/spark-17219.
Diffstat (limited to 'sql/core/src')
-rw-r--r-- | sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala | 4 |
1 files changed, 3 insertions, 1 deletions
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala index 1855eab96e..d69be36917 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala @@ -52,6 +52,7 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { * The algorithm was first present in [[http://dx.doi.org/10.1145/375663.375670 Space-efficient * Online Computation of Quantile Summaries]] by Greenwald and Khanna. * + * Note that NaN values will be removed from the numerical column before calculation * @param col the name of the numerical column * @param probabilities a list of quantile probabilities * Each number must belong to [0, 1]. @@ -67,7 +68,8 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) { col: String, probabilities: Array[Double], relativeError: Double): Array[Double] = { - StatFunctions.multipleApproxQuantiles(df, Seq(col), probabilities, relativeError).head.toArray + StatFunctions.multipleApproxQuantiles(df.select(col).na.drop(), + Seq(col), probabilities, relativeError).head.toArray } /** |