aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-clustering.md
diff options
context:
space:
mode:
authorYuhao Yang <hhbyyh@gmail.com>2015-05-12 15:12:29 -0700
committerJoseph K. Bradley <joseph@databricks.com>2015-05-12 15:12:29 -0700
commit1d703660d4d14caea697affdf31170aea44c8903 (patch)
treeef42819a3cd0e08bed69dc0d5376ed2d06519f7b /docs/mllib-clustering.md
parent1422e79e517ca14a6b0e178f015362d2e0d413c6 (diff)
downloadspark-1d703660d4d14caea697affdf31170aea44c8903.tar.gz
spark-1d703660d4d14caea697affdf31170aea44c8903.tar.bz2
spark-1d703660d4d14caea697affdf31170aea44c8903.zip
[SPARK-7496] [MLLIB] Update Programming guide with Online LDA
jira: https://issues.apache.org/jira/browse/SPARK-7496 Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #6046 from hhbyyh/ldaDocument and squashes the following commits: 4b6fbfa [Yuhao Yang] add online paper and some comparison fd4c983 [Yuhao Yang] update lda document for optimizers
Diffstat (limited to 'docs/mllib-clustering.md')
-rw-r--r--docs/mllib-clustering.md6
1 files changed, 3 insertions, 3 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index f5aa15b7d9..f41ca70952 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -377,11 +377,11 @@ LDA can be thought of as a clustering algorithm as follows:
on a statistical model of how text documents are generated.
LDA takes in a collection of documents as vectors of word counts.
-It learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
-on the likelihood function. After fitting on the documents, LDA provides:
+It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
+on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
-* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics.
+* Topic distributions for documents: For each document in the training set, LDA gives a probability distribution over topics. (EM only)
LDA takes the following parameters: