diff options
author | Joseph K. Bradley <joseph@databricks.com> | 2015-04-12 22:38:27 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-04-12 22:38:27 -0700 |
commit | d3792f54974e16cbe8f10b3091d248e0bdd48986 (patch) | |
tree | 89d679f7a9f76599841f169239021f190968654b /bagel | |
parent | fc17661475443d9f0a8d28e3439feeb7a7bca67b (diff) | |
download | spark-d3792f54974e16cbe8f10b3091d248e0bdd48986.tar.gz spark-d3792f54974e16cbe8f10b3091d248e0bdd48986.tar.bz2 spark-d3792f54974e16cbe8f10b3091d248e0bdd48986.zip |
[SPARK-4081] [mllib] VectorIndexer
**Ready for review!**
Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.
This introduces a VectorIndexer class which does the following:
* VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories)
* Feature which exceed maxCategories are declared continuous, and the Model will treat them as such.
* VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices
Design notes:
* This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
* This does not yet support transforming data with new (unknown) categorical feature values. That can be added later.
* This is necessary for DecisionTree and tree ensembles.
Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.
Other notes:
* This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #3000 from jkbradley/indexer and squashes the following commits:
5956d91 [Joseph K. Bradley] minor cleanups
f5c57a8 [Joseph K. Bradley] added Java test suite
643b444 [Joseph K. Bradley] removed FeatureTests
02236c3 [Joseph K. Bradley] Updated VectorIndexer, ready for PR
286d221 [Joseph K. Bradley] Reworked DatasetIndexer for spark.ml API, and renamed it to VectorIndexer
12e6cf2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into indexer
6d8f3f1 [Joseph K. Bradley] Added partly done DatasetIndexer to spark.ml
6a2f553 [Joseph K. Bradley] Updated TODO for allowUnknownCategories
3f041f8 [Joseph K. Bradley] Final cleanups for DatasetIndexer
038b9e3 [Joseph K. Bradley] DatasetIndexer now maintains sparsity in SparseVector
3a4a0bd [Joseph K. Bradley] Added another test for DatasetIndexer
2006923 [Joseph K. Bradley] DatasetIndexer now passes tests
f409987 [Joseph K. Bradley] partly done with DatasetIndexerSuite
5e7c874 [Joseph K. Bradley] working on DatasetIndexer
Diffstat (limited to 'bagel')
0 files changed, 0 insertions, 0 deletions