diff options
author | Yuhao Yang <hhbyyh@gmail.com> | 2015-07-09 10:26:38 -0700 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2015-07-09 10:26:38 -0700 |
commit | 0cd84c86cac68600a74d84e50ad40c0c8b84822a (patch) | |
tree | 5c74ebeb5fa6999d14a51ac51a60783f6fb25fca /streaming | |
parent | c59e268d17cf10e46dbdbe760e2a7580a6364692 (diff) | |
download | spark-0cd84c86cac68600a74d84e50ad40c0c8b84822a.tar.gz spark-0cd84c86cac68600a74d84e50ad40c0c8b84822a.tar.bz2 spark-0cd84c86cac68600a74d84e50ad40c0c8b84822a.zip |
[SPARK-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector
jira: https://issues.apache.org/jira/browse/SPARK-8703
Converts a text document to a sparse vector of token counts.
I can further add an estimator to extract vocabulary from corpus if that's appropriate.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #7084 from hhbyyh/countVectorization and squashes the following commits:
5f3f655 [Yuhao Yang] text change
24728e4 [Yuhao Yang] style improvement
576728a [Yuhao Yang] rename to model and some fix
1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
99b0c14 [Yuhao Yang] undo extension from HashingTF
12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
7ee1c31 [Yuhao Yang] extends HashingTF
809fb59 [Yuhao Yang] minor fix for ut
7c61fb3 [Yuhao Yang] add countVectorizer
Diffstat (limited to 'streaming')
0 files changed, 0 insertions, 0 deletions