[SPARK-13568][ML] Create feature transformer to impute missing values - spark

diff options

author	Yuhao Yang <hhbyyh@gmail.com>	2017-03-16 12:49:59 +0200
committer	Nick Pentreath <nickp@za.ibm.com>	2017-03-16 12:49:59 +0200
commit	d647aae278ef31a07fc64715eb07e48294d94bb8 (patch)
tree	13570e50f38a430469158ff5305a67edf2d301d1 /sql/core/src
parent	1472cac4bb31c1886f82830778d34c4dd9030d7a (diff)
download	spark-d647aae278ef31a07fc64715eb07e48294d94bb8.tar.gz spark-d647aae278ef31a07fc64715eb07e48294d94bb8.tar.bz2 spark-d647aae278ef31a07fc64715eb07e48294d94bb8.zip

[SPARK-13568][ML] Create feature transformer to impute missing values

## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13568 It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn. Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc). Currently this PR supports imputation for Double and Vector (null and NaN in Vector). ## How was this patch tested? new unit tests and manual test Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao <yuhao.yang@intel.com> Closes #11601 from hhbyyh/imputer.

Diffstat (limited to 'sql/core/src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: