diff options
author | Yuhao Yang <hhbyyh@gmail.com> | 2017-03-16 12:49:59 +0200 |
---|---|---|
committer | Nick Pentreath <nickp@za.ibm.com> | 2017-03-16 12:49:59 +0200 |
commit | d647aae278ef31a07fc64715eb07e48294d94bb8 (patch) | |
tree | 13570e50f38a430469158ff5305a67edf2d301d1 /sql/core/src | |
parent | 1472cac4bb31c1886f82830778d34c4dd9030d7a (diff) | |
download | spark-d647aae278ef31a07fc64715eb07e48294d94bb8.tar.gz spark-d647aae278ef31a07fc64715eb07e48294d94bb8.tar.bz2 spark-d647aae278ef31a07fc64715eb07e48294d94bb8.zip |
[SPARK-13568][ML] Create feature transformer to impute missing values
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13568
It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.
Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).
Currently this PR supports imputation for Double and Vector (null and NaN in Vector).
## How was this patch tested?
new unit tests and manual test
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao <yuhao.yang@intel.com>
Closes #11601 from hhbyyh/imputer.
Diffstat (limited to 'sql/core/src')
0 files changed, 0 insertions, 0 deletions