diff options
author | Sandy Ryza <sandy@cloudera.com> | 2015-05-05 12:34:02 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-05-05 12:34:02 -0700 |
commit | 47728db7cfac995d9417cdf0e16d07391aabd581 (patch) | |
tree | 4479d80b2c281512c29ea0f32d21168cca493e58 /python/pyspark | |
parent | ee374e89cd1f08730fed9d50b742627d5b19d241 (diff) | |
download | spark-47728db7cfac995d9417cdf0e16d07391aabd581.tar.gz spark-47728db7cfac995d9417cdf0e16d07391aabd581.tar.bz2 spark-47728db7cfac995d9417cdf0e16d07391aabd581.zip |
[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach.
A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName.
Author: Sandy Ryza <sandy@cloudera.com>
Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits:
f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
Diffstat (limited to 'python/pyspark')
0 files changed, 0 insertions, 0 deletions