diff options
author | Yanbo Liang <ybliang8@gmail.com> | 2016-08-19 03:23:16 -0700 |
---|---|---|
committer | Yanbo Liang <ybliang8@gmail.com> | 2016-08-19 03:23:16 -0700 |
commit | 864be9359ae2f8409e6dbc38a7a18593f9cc5692 (patch) | |
tree | 0995685df71db57dd6360fb3830fe7364b6fb42c /tools | |
parent | 5377fc62360d5e9b5c94078e41d10a96e0e8a535 (diff) | |
download | spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.tar.gz spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.tar.bz2 spark-864be9359ae2f8409e6dbc38a7a18593f9cc5692.zip |
[SPARK-17141][ML] MinMaxScaler should remain NaN value.
## What changes were proposed in this pull request?
In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately.
* If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```.
* Otherwise, it will remain ```NaN``` after transformation.
I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #14716 from yanboliang/spark-17141.
Diffstat (limited to 'tools')
0 files changed, 0 insertions, 0 deletions