diff options
author | Yuhao Yang <hhbyyh@gmail.com> | 2016-04-20 11:45:08 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-04-20 11:45:08 +0100 |
commit | ed9d80385486cd39a84a689ef467795262af919a (patch) | |
tree | 954f979137825630508203428c0f6e6869373138 /examples/src/main/java | |
parent | 17db4bfeaa0074298db622db38a5b0459518c4a9 (diff) | |
download | spark-ed9d80385486cd39a84a689ef467795262af919a.tar.gz spark-ed9d80385486cd39a84a689ef467795262af919a.tar.bz2 spark-ed9d80385486cd39a84a689ef467795262af919a.zip |
[SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF
## What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
## How was this patch tested?
unit tests and doc generation
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #12454 from hhbyyh/tfdoc.
Diffstat (limited to 'examples/src/main/java')
-rw-r--r-- | examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java | 2 |
1 files changed, 2 insertions, 0 deletions
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java index 37a3d0d84d..107c835f2e 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java @@ -63,6 +63,8 @@ public class JavaTfIdfExample { .setOutputCol("rawFeatures") .setNumFeatures(numFeatures); Dataset<Row> featurizedData = hashingTF.transform(wordsData); + // alternatively, CountVectorizer can also be used to get term frequency vectors + IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features"); IDFModel idfModel = idf.fit(featurizedData); Dataset<Row> rescaledData = idfModel.transform(featurizedData); |