aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-tuning.md
diff options
context:
space:
mode:
authorJoseph K. Bradley <joseph@databricks.com>2016-07-15 13:38:23 -0700
committerJoseph K. Bradley <joseph@databricks.com>2016-07-15 13:38:23 -0700
commit5ffd5d3838da40ad408a6f40071fe6f4dcacf2a1 (patch)
tree4d2c6476c38f84ef34eef20077f8e491b172681d /docs/ml-tuning.md
parent71ad945bbbdd154eae852cd7f841e98f7a83e8d4 (diff)
downloadspark-5ffd5d3838da40ad408a6f40071fe6f4dcacf2a1.tar.gz
spark-5ffd5d3838da40ad408a6f40071fe6f4dcacf2a1.tar.bz2
spark-5ffd5d3838da40ad408a6f40071fe6f4dcacf2a1.zip
[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide
## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.
Diffstat (limited to 'docs/ml-tuning.md')
-rw-r--r--docs/ml-tuning.md121
1 files changed, 121 insertions, 0 deletions
diff --git a/docs/ml-tuning.md b/docs/ml-tuning.md
new file mode 100644
index 0000000000..2ca90c7092
--- /dev/null
+++ b/docs/ml-tuning.md
@@ -0,0 +1,121 @@
+---
+layout: global
+title: "ML Tuning"
+displayTitle: "ML Tuning: model selection and hyperparameter tuning"
+---
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+This section describes how to use MLlib's tooling for tuning ML algorithms and Pipelines.
+Built-in Cross-Validation and other tooling allow users to optimize hyperparameters in algorithms and Pipelines.
+
+**Table of contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Model selection (a.k.a. hyperparameter tuning)
+
+An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*.
+Tuning may be done for individual `Estimator`s such as `LogisticRegression`, or for entire `Pipeline`s which include multiple algorithms, featurization, and other steps. Users can tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
+
+MLlib supports model selection using tools such as [`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) and [`TrainValidationSplit`](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit).
+These tools require the following items:
+
+* [`Estimator`](api/scala/index.html#org.apache.spark.ml.Estimator): algorithm or `Pipeline` to tune
+* Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter grid" to search over
+* [`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator): metric to measure how well a fitted `Model` does on held-out test data
+
+At a high level, these model selection tools work as follows:
+
+* They split the input data into separate training and test datasets.
+* For each (training, test) pair, they iterate through the set of `ParamMap`s:
+ * For each `ParamMap`, they fit the `Estimator` using those parameters, get the fitted `Model`, and evaluate the `Model`'s performance using the `Evaluator`.
+* They select the `Model` produced by the best-performing set of parameters.
+
+The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
+for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
+for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
+for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
+method in each of these evaluators.
+
+To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
+
+# Cross-Validation
+
+`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets. E.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 `Model`s produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.
+
+After identifying the best `ParamMap`, `CrossValidator` finally re-fits the `Estimator` using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via cross-validation
+
+The following example demonstrates using `CrossValidator` to select from a grid of parameters.
+
+Note that cross-validation over a grid of parameters is expensive.
+E.g., in the example below, the parameter grid has 3 values for `hashingTF.numFeatures` and 2 values for `lr.regParam`, and `CrossValidator` uses 2 folds. This multiplies out to `$(3 \times 2) \times 2 = 12$` different models being trained.
+In realistic settings, it can be common to try many more parameters and use more folds (`$k=3$` and `$k=10$` are common).
+In other words, using `CrossValidator` can be very expensive.
+However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java %}
+</div>
+
+<div data-lang="python">
+
+{% include_example python/ml/cross_validator.py %}
+</div>
+
+</div>
+
+# Train-Validation Split
+
+In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
+`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
+ the case of `CrossValidator`. It is therefore less expensive,
+ but will not produce as reliable results when the training dataset is not sufficiently large.
+
+Unlike `CrossValidator`, `TrainValidationSplit` creates a single (training, test) dataset pair.
+It splits the dataset into these two parts using the `trainRatio` parameter. For example with `$trainRatio=0.75$`,
+`TrainValidationSplit` will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
+
+Like `CrossValidator`, `TrainValidationSplit` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via train validation split
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/train_validation_split.py %}
+</div>
+
+</div>