2 files changed, 45 insertions, 9 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index c79bcac461..c44ace91f2 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -18,27 +18,60 @@ This section covers algorithms for working with features, roughly divided into t
 
 # Feature Extractors
 
-## TF-IDF (HashingTF and IDF)
-
-[Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step.  In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
+## TF-IDF
+
+[Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) 
+is a feature vectorization method widely used in text mining to reflect the importance of a term 
+to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.
+Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in document `$d$`, while 
+document frequency `$DF(t, D)$` is the number of documents that contains term `$t$`. If we only use 
+term frequency to measure the importance, it is very easy to over-emphasize terms that appear very 
+often but carry little information about the document, e.g., "a", "the", and "of". If a term appears 
+very often across the corpus, it means it doesn't carry special information about a particular document.
+Inverse document frequency is a numerical measure of how much information a term provides:
+`\[
+IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},
+\]`
+where `$|D|$` is the total number of documents in the corpus. Since logarithm is used, if a term 
+appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid 
+dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
+`\[
+TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
+\]`
+There are several variants on the definition of term frequency and document frequency.
+In MLlib, we separate TF and IDF to make them flexible.
 
 **TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors. 
 
 `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
 fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
-The algorithm combines Term Frequency (TF) counts with the 
-[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
+`HashingTF` utilizes the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
+A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies 
+are calculated based on the mapped indices. This approach avoids the need to compute a global 
+term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash 
+collisions, where different raw features may become the same term after hashing. To reduce the 
+chance of collision, we can increase the target feature dimension, i.e., the number of buckets 
+of the hash table. Since a simple modulo is used to transform the hash function to a column index, 
+it is advisable to use a power of two as the feature dimension, otherwise the features will 
+not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. 
 
 `CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer
 ](ml-features.html#countvectorizer) for more details.
 
 **IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`.  The 
-`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column.  
-Intuitively, it down-weights columns which appear frequently in a corpus.
+`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and 
+scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
 
-Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
+**Note:** `spark.ml` doesn't provide tools for text segmentation.
+We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and 
+[scalanlp/chalk](https://github.com/scalanlp/chalk).
+
+**Examples**
 
-In the following code segment, we start with a set of sentences.  We split each sentence into words using `Tokenizer`.  For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector.  We use `IDF` to rescale the feature vectors; this generally improves performance when using text as features.  Our feature vectors could then be passed to a learning algorithm.
+In the following code segment, we start with a set of sentences.  We split each sentence into words 
+using `Tokenizer`.  For each sentence (bag of words), we use `HashingTF` to hash the sentence into 
+a feature vector.  We use `IDF` to rescale the feature vectors; this generally improves performance 
+when using text as features.  Our feature vectors could then be passed to a learning algorithm.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 7a97285032..4c027c84ec 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -10,6 +10,9 @@ displayTitle: Feature Extraction and Transformation - spark.mllib
 
 ## TF-IDF
 
+**Note** We recommend using the DataFrame-based API, which is detailed in the [ML user guide on 
+TF-IDF](ml-features.html#tf-idf).
+
 [Term frequency-inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a feature 
 vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.
 Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.