[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer - spark

diff options

author	Sandy Ryza <sandy@cloudera.com>	2015-05-05 12:34:02 -0700
committer	Xiangrui Meng <meng@databricks.com>	2015-05-05 12:34:02 -0700
commit	47728db7cfac995d9417cdf0e16d07391aabd581 (patch)
tree	4479d80b2c281512c29ea0f32d21168cca493e58 /python/pyspark
parent	ee374e89cd1f08730fed9d50b742627d5b19d241 (diff)
download	spark-47728db7cfac995d9417cdf0e16d07391aabd581.tar.gz spark-47728db7cfac995d9417cdf0e16d07391aabd581.tar.bz2 spark-47728db7cfac995d9417cdf0e16d07391aabd581.zip

[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer

This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <sandy@cloudera.com> Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer

Diffstat (limited to 'python/pyspark')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: