aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-clustering.md
diff options
context:
space:
mode:
authorJoseph K. Bradley <joseph@databricks.com>2015-09-15 19:43:26 -0700
committerXiangrui Meng <meng@databricks.com>2015-09-15 19:43:26 -0700
commitb921fe4dc0442aa133ab7d55fba24bc798d59aa2 (patch)
tree5a545ee45ab39f6caad096049818564914635334 /docs/mllib-clustering.md
parent64c29afcb787d9f176a197c25314295108ba0471 (diff)
downloadspark-b921fe4dc0442aa133ab7d55fba24bc798d59aa2.tar.gz
spark-b921fe4dc0442aa133ab7d55fba24bc798d59aa2.tar.bz2
spark-b921fe4dc0442aa133ab7d55fba24bc798d59aa2.zip
[SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanups
Various ML guide cleanups. * ml-guide.md: Make it easier to access the algorithm-specific guides. * LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically. E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics. * mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec” * Clean up Binarizer user guide a little. * Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place. * spark.ml Word2Vec user guide: clean up grammar/writing * Chi Sq Feature Selector docs: Improve text in doc. CC: mengxr feynmanliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8752 from jkbradley/mlguide-fixes-1.5.
Diffstat (limited to 'docs/mllib-clustering.md')
-rw-r--r--docs/mllib-clustering.md4
1 files changed, 4 insertions, 0 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 3fb35d3c50..c2711cf82d 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -507,6 +507,10 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
* `maxIterations`: The maximum number of EM iterations.
+*Note*: It is important to do enough iterations. In early iterations, EM often has useless topics,
+but those topics improve dramatically after more iterations. Using at least 20 and possibly
+50-100 iterations is often reasonable, depending on your dataset.
+
`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
the inferred topics but also the full training corpus and topic
distributions for each document in the training corpus. A