aboutsummaryrefslogtreecommitdiff
path: root/core
diff options
context:
space:
mode:
authorSandy Ryza <sandy@cloudera.com>2015-05-05 12:34:02 -0700
committerXiangrui Meng <meng@databricks.com>2015-05-05 12:34:02 -0700
commit47728db7cfac995d9417cdf0e16d07391aabd581 (patch)
tree4479d80b2c281512c29ea0f32d21168cca493e58 /core
parentee374e89cd1f08730fed9d50b742627d5b19d241 (diff)
downloadspark-47728db7cfac995d9417cdf0e16d07391aabd581.tar.gz
spark-47728db7cfac995d9417cdf0e16d07391aabd581.tar.bz2
spark-47728db7cfac995d9417cdf0e16d07391aabd581.zip
[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <sandy@cloudera.com> Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
Diffstat (limited to 'core')
0 files changed, 0 insertions, 0 deletions