[SPARK-17141][ML] MinMaxScaler should remain NaN value.

## What changes were proposed in this pull request? In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately. * If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```. * Otherwise, it will remain ```NaN``` after transformation. I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14716 from yanboliang/spark-17141.
author: Yanbo Liang <ybliang8@gmail.com> 2016-08-19 03:23:16 -0700
committer: Yanbo Liang <ybliang8@gmail.com> 2016-08-19 03:23:16 -0700
commit: 864be9359ae2f8409e6dbc38a7a18593f9cc5692 (patch)
tree: 0995685df71db57dd6360fb3830fe7364b6fb42c /mllib/src/main
parent: 5377fc62360d5e9b5c94078e41d10a96e0e8a535 (diff)
download: spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.tar.gz
spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.tar.bz2
spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.zip
1 files changed, 4 insertions, 2 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
index 9f3d2ca6db..28cbe1cb01 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
@@ -186,8 +186,10 @@ class MinMaxScalerModel private[ml] (
       val size = values.length
       var i = 0
       while (i < size) {
-        val raw = if (originalRange(i) != 0) (values(i) - minArray(i)) / originalRange(i) else 0.5
-        values(i) = raw * scale + $(min)
+        if (!values(i).isNaN) {
+          val raw = if (originalRange(i) != 0) (values(i) - minArray(i)) / originalRange(i) else 0.5
+          values(i) = raw * scale + $(min)
+        }
         i += 1
       }
       Vectors.dense(values)
author	Yanbo Liang <ybliang8@gmail.com>	2016-08-19 03:23:16 -0700
committer	Yanbo Liang <ybliang8@gmail.com>	2016-08-19 03:23:16 -0700
commit	864be9359ae2f8409e6dbc38a7a18593f9cc5692 (patch)
tree	0995685df71db57dd6360fb3830fe7364b6fb42c /mllib/src/main
parent	5377fc62360d5e9b5c94078e41d10a96e0e8a535 (diff)
download	spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.tar.gz spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.tar.bz2 spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.zip