aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/ml-features.md51
-rw-r--r--docs/mllib-feature-extraction.md3
2 files changed, 45 insertions, 9 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index c79bcac461..c44ace91f2 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -18,27 +18,60 @@ This section covers algorithms for working with features, roughly divided into t
# Feature Extractors
-## TF-IDF (HashingTF and IDF)
-
-[Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
+## TF-IDF
+
+[Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf)
+is a feature vectorization method widely used in text mining to reflect the importance of a term
+to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.
+Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while
+document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use
+term frequency to measure the importance, it is very easy to over-emphasize terms that appear very
+often but carry little information about the document, e.g., "a", "the", and "of". If a term appears
+very often across the corpus, it means it doesn't carry special information about a particular document.
+Inverse document frequency is a numerical measure of how much information a term provides:
+`\[
+IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},
+\]`
+where `$|D|$` is the total number of documents in the corpus. Since logarithm is used, if a term
+appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid
+dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
+`\[
+TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
+\]`
+There are several variants on the definition of term frequency and document frequency.
+In MLlib, we separate TF and IDF to make them flexible.
**TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors.
`HashingTF` is a `Transformer` which takes sets of terms and converts those sets into
fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words.
-The algorithm combines Term Frequency (TF) counts with the
-[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
+`HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
+A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies
+are calculated based on the mapped indices. This approach avoids the need to compute a global
+term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash
+collisions, where different raw features may become the same term after hashing. To reduce the
+chance of collision, we can increase the target feature dimension, i.e., the number of buckets
+of the hash table. Since a simple modulo is used to transform the hash function to a column index,
+it is advisable to use a power of two as the feature dimension, otherwise the features will
+not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
`CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer
](ml-features.html#countvectorizer) for more details.
**IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`. The
-`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column.
-Intuitively, it down-weights columns which appear frequently in a corpus.
+`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and
+scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
-Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
+**Note:** `spark.ml` doesn't provide tools for text segmentation.
+We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and
+[scalanlp/chalk](https://github.com/scalanlp/chalk).
+
+**Examples**
-In the following code segment, we start with a set of sentences. We split each sentence into words using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.
+In the following code segment, we start with a set of sentences. We split each sentence into words
+using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into
+a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance
+when using text as features. Our feature vectors could then be passed to a learning algorithm.
<div class="codetabs">
<div data-lang="scala" markdown="1">
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 7a97285032..4c027c84ec 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -10,6 +10,9 @@ displayTitle: Feature Extraction and Transformation - spark.mllib
## TF-IDF
+**Note** We recommend using the DataFrame-based API, which is detailed in the [ML user guide on
+TF-IDF](ml-features.html#tf-idf).
+
[Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a feature
vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.