diff options
author | Yuhao Yang <hhbyyh@gmail.com> | 2016-04-20 11:45:08 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-04-20 11:45:08 +0100 |
commit | ed9d80385486cd39a84a689ef467795262af919a (patch) | |
tree | 954f979137825630508203428c0f6e6869373138 /docs | |
parent | 17db4bfeaa0074298db622db38a5b0459518c4a9 (diff) | |
download | spark-ed9d80385486cd39a84a689ef467795262af919a.tar.gz spark-ed9d80385486cd39a84a689ef467795262af919a.tar.bz2 spark-ed9d80385486cd39a84a689ef467795262af919a.zip |
[SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF
## What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
## How was this patch tested?
unit tests and doc generation
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #12454 from hhbyyh/tfdoc.
Diffstat (limited to 'docs')
-rw-r--r-- | docs/ml-features.md | 15 |
1 files changed, 12 insertions, 3 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index 876d21f495..11d5acbb10 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -22,10 +22,19 @@ This section covers algorithms for working with features, roughly divided into t [Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF. -**TF**: `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. -The algorithm combines Term Frequency (TF) counts with the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction. +**TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors. -**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus. +`HashingTF` is a `Transformer` which takes sets of terms and converts those sets into +fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. +The algorithm combines Term Frequency (TF) counts with the +[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction. + +`CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer +](ml-features.html#countvectorizer) for more details. + +**IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`. The +`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column. +Intuitively, it down-weights columns which appear frequently in a corpus. Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency. |