aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
authorYuhao Yang <hhbyyh@gmail.com>2016-05-17 20:44:19 +0200
committerNick Pentreath <nickp@za.ibm.com>2016-05-17 20:44:19 +0200
commit3308a862ba0983268c9d5acf9e2a7d2b62d3ec27 (patch)
tree408b6dcc04bd77e835e445aedb062818a4ce98d8 /docs/ml-features.md
parent8d05a7a98bdbd3ce7c81d273e05a375877ebe68f (diff)
downloadspark-3308a862ba0983268c9d5acf9e2a7d2b62d3ec27.tar.gz
spark-3308a862ba0983268c9d5acf9e2a7d2b62d3ec27.tar.bz2
spark-3308a862ba0983268c9d5acf9e2a7d2b62d3ec27.zip
[SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idf
## What changes were proposed in this pull request? We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide. ## How was this patch tested? manual review for doc. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Closes #12957 from hhbyyh/tfidfdoc.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md51
1 files changed, 42 insertions, 9 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index c79bcac461..c44ace91f2 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -18,27 +18,60 @@ This section covers algorithms for working with features, roughly divided into t
# Feature Extractors
-## TF-IDF (HashingTF and IDF)
-
-[Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
+## TF-IDF
+
+[Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)
+is a feature vectorization method widely used in text mining to reflect the importance of a term
+to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.
+Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while
+document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use
+term frequency to measure the importance, it is very easy to over-emphasize terms that appear very
+often but carry little information about the document, e.g., "a", "the", and "of". If a term appears
+very often across the corpus, it means it doesn't carry special information about a particular document.
+Inverse document frequency is a numerical measure of how much information a term provides:
+`\[
+IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},
+\]`
+where `$|D|$` is the total number of documents in the corpus. Since logarithm is used, if a term
+appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid
+dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
+`\[
+TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
+\]`
+There are several variants on the definition of term frequency and document frequency.
+In MLlib, we separate TF and IDF to make them flexible.
**TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors.
`HashingTF` is a `Transformer` which takes sets of terms and converts those sets into
fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words.
-The algorithm combines Term Frequency (TF) counts with the
-[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
+`HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
+A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies
+are calculated based on the mapped indices. This approach avoids the need to compute a global
+term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash
+collisions, where different raw features may become the same term after hashing. To reduce the
+chance of collision, we can increase the target feature dimension, i.e., the number of buckets
+of the hash table. Since a simple modulo is used to transform the hash function to a column index,
+it is advisable to use a power of two as the feature dimension, otherwise the features will
+not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
`CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer
](ml-features.html#countvectorizer) for more details.
**IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`. The
-`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column.
-Intuitively, it down-weights columns which appear frequently in a corpus.
+`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and
+scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
-Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
+**Note:** `spark.ml` doesn't provide tools for text segmentation.
+We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and
+[scalanlp/chalk](https://github.com/scalanlp/chalk).
+
+**Examples**
-In the following code segment, we start with a set of sentences. We split each sentence into words using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.
+In the following code segment, we start with a set of sentences. We split each sentence into words
+using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into
+a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance
+when using text as features. Our feature vectors could then be passed to a learning algorithm.
<div class="codetabs">
<div data-lang="scala" markdown="1">