aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-guide.md
diff options
context:
space:
mode:
authorJoseph K. Bradley <joseph@databricks.com>2015-09-15 19:43:26 -0700
committerXiangrui Meng <meng@databricks.com>2015-09-15 19:43:26 -0700
commitb921fe4dc0442aa133ab7d55fba24bc798d59aa2 (patch)
tree5a545ee45ab39f6caad096049818564914635334 /docs/ml-guide.md
parent64c29afcb787d9f176a197c25314295108ba0471 (diff)
downloadspark-b921fe4dc0442aa133ab7d55fba24bc798d59aa2.tar.gz
spark-b921fe4dc0442aa133ab7d55fba24bc798d59aa2.tar.bz2
spark-b921fe4dc0442aa133ab7d55fba24bc798d59aa2.zip
[SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanups
Various ML guide cleanups. * ml-guide.md: Make it easier to access the algorithm-specific guides. * LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically. E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics. * mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec” * Clean up Binarizer user guide a little. * Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place. * spark.ml Word2Vec user guide: clean up grammar/writing * Chi Sq Feature Selector docs: Improve text in doc. CC: mengxr feynmanliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8752 from jkbradley/mlguide-fixes-1.5.
Diffstat (limited to 'docs/ml-guide.md')
-rw-r--r--docs/ml-guide.md31
1 files changed, 20 insertions, 11 deletions
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 78c93a95c7..c5d7f99002 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -32,7 +32,21 @@ See the [algorithm guides](#algorithm-guides) section below for guides on sub-pa
* This will become a table of contents (this text will be scraped).
{:toc}
-# Main concepts
+# Algorithm guides
+
+We provide several algorithm guides specific to the Pipelines API.
+Several of these algorithms, such as certain feature transformers, are not in the `spark.mllib` API.
+Also, some algorithms have additional capabilities in the `spark.ml` API; e.g., random forests
+provide class probabilities, and linear models provide model summaries.
+
+* [Feature extraction, transformation, and selection](ml-features.html)
+* [Decision Trees for classification and regression](ml-decision-tree.html)
+* [Ensembles](ml-ensembles.html)
+* [Linear methods with elastic net regularization](ml-linear-methods.html)
+* [Multilayer perceptron classifier](ml-ann.html)
+
+
+# Main concepts in Pipelines
Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple
algorithms into a single pipeline, or workflow.
@@ -166,6 +180,11 @@ compile-time type checking.
`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.
+*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances. E.g., the same instance
+`myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
+unique IDs. However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
+can be put into the same `Pipeline` since different instances will be created with different IDs.
+
## Parameters
Spark ML `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
@@ -184,16 +203,6 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s.
For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
-# Algorithm guides
-
-There are now several algorithms in the Pipelines API which are not in the `spark.mllib` API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
-
-* [Feature extraction, transformation, and selection](ml-features.html)
-* [Decision Trees for classification and regression](ml-decision-tree.html)
-* [Ensembles](ml-ensembles.html)
-* [Linear methods with elastic net regularization](ml-linear-methods.html)
-* [Multilayer perceptron classifier](ml-ann.html)
-
# Code examples
This section gives code examples illustrating the functionality discussed above.