aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/_data/menu-ml.yaml6
-rw-r--r--docs/_includes/nav-left-wrapper-ml.html4
-rwxr-xr-xdocs/_layouts/global.html2
-rw-r--r--docs/index.md4
-rw-r--r--docs/ml-advanced.md4
-rw-r--r--docs/ml-ann.md4
-rw-r--r--docs/ml-classification-regression.md60
-rw-r--r--docs/ml-clustering.md8
-rw-r--r--docs/ml-collaborative-filtering.md4
-rw-r--r--docs/ml-decision-tree.md4
-rw-r--r--docs/ml-ensembles.md4
-rw-r--r--docs/ml-features.md4
-rw-r--r--docs/ml-guide.md461
-rw-r--r--docs/ml-linear-methods.md4
-rw-r--r--docs/ml-migration-guides.md159
-rw-r--r--docs/ml-pipeline.md245
-rw-r--r--docs/ml-survival-regression.md4
-rw-r--r--docs/ml-tuning.md121
-rw-r--r--docs/mllib-classification-regression.md4
-rw-r--r--docs/mllib-clustering.md4
-rw-r--r--docs/mllib-collaborative-filtering.md4
-rw-r--r--docs/mllib-data-types.md4
-rw-r--r--docs/mllib-decision-tree.md4
-rw-r--r--docs/mllib-dimensionality-reduction.md4
-rw-r--r--docs/mllib-ensembles.md4
-rw-r--r--docs/mllib-evaluation-metrics.md4
-rw-r--r--docs/mllib-feature-extraction.md4
-rw-r--r--docs/mllib-frequent-pattern-mining.md4
-rw-r--r--docs/mllib-guide.md219
-rw-r--r--docs/mllib-isotonic-regression.md4
-rw-r--r--docs/mllib-linear-methods.md4
-rw-r--r--docs/mllib-migration-guides.md158
-rw-r--r--docs/mllib-naive-bayes.md4
-rw-r--r--docs/mllib-optimization.md4
-rw-r--r--docs/mllib-pmml-model-export.md4
-rw-r--r--docs/mllib-statistics.md4
-rw-r--r--docs/programming-guide.md2
-rw-r--r--docs/streaming-programming-guide.md4
38 files changed, 807 insertions, 742 deletions
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index 3fd3ee2823..0c6b9b20a6 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,5 +1,5 @@
-- text: "Overview: estimators, transformers and pipelines"
- url: ml-guide.html
+- text: Pipelines
+ url: ml-pipeline.html
- text: Extracting, transforming and selecting features
url: ml-features.html
- text: Classification and Regression
@@ -8,5 +8,7 @@
url: ml-clustering.html
- text: Collaborative filtering
url: ml-collaborative-filtering.html
+- text: Model selection and tuning
+ url: ml-tuning.html
- text: Advanced topics
url: ml-advanced.html
diff --git a/docs/_includes/nav-left-wrapper-ml.html b/docs/_includes/nav-left-wrapper-ml.html
index e2d7eda027..00ac6cc0db 100644
--- a/docs/_includes/nav-left-wrapper-ml.html
+++ b/docs/_includes/nav-left-wrapper-ml.html
@@ -1,8 +1,8 @@
<div class="left-menu-wrapper">
<div class="left-menu">
- <h3><a href="ml-guide.html">spark.ml package</a></h3>
+ <h3><a href="ml-guide.html">MLlib: Main Guide</a></h3>
{% include nav-left.html nav=include.nav-ml %}
- <h3><a href="mllib-guide.html">spark.mllib package</a></h3>
+ <h3><a href="mllib-guide.html">MLlib: RDD-based API Guide</a></h3>
{% include nav-left.html nav=include.nav-mllib %}
</div>
</div> \ No newline at end of file
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 2d0c3fd712..d3bf082aa7 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -74,7 +74,7 @@
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
<li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
<li><a href="structured-streaming-programming-guide.html">Structured Streaming</a></li>
- <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
+ <li><a href="ml-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
<li><a href="sparkr.html">SparkR (R on Spark)</a></li>
</ul>
diff --git a/docs/index.md b/docs/index.md
index 7157afc411..0cb8803783 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -8,7 +8,7 @@ description: Apache Spark SPARK_VERSION_SHORT documentation homepage
Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs.
-It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
+It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
# Downloading
@@ -87,7 +87,7 @@ options for deployment:
* Modules built on Spark:
* [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
* [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): support for structured data and relational queries
- * [MLlib](mllib-guide.html): built-in machine learning library
+ * [MLlib](ml-guide.html): built-in machine learning library
* [GraphX](graphx-programming-guide.html): Spark's new API for graph processing
**API Docs:**
diff --git a/docs/ml-advanced.md b/docs/ml-advanced.md
index 1c5f844b08..f5804fdeee 100644
--- a/docs/ml-advanced.md
+++ b/docs/ml-advanced.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Advanced topics - spark.ml
-displayTitle: Advanced topics - spark.ml
+title: Advanced topics
+displayTitle: Advanced topics
---
* Table of contents
diff --git a/docs/ml-ann.md b/docs/ml-ann.md
index c2d9bd200f..7c460c4af6 100644
--- a/docs/ml-ann.md
+++ b/docs/ml-ann.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Multilayer perceptron classifier - spark.ml
-displayTitle: Multilayer perceptron classifier - spark.ml
+title: Multilayer perceptron classifier
+displayTitle: Multilayer perceptron classifier
---
> This section has been moved into the
diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md
index 3d6106b532..7c2437eacd 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Classification and regression - spark.ml
-displayTitle: Classification and regression - spark.ml
+title: Classification and regression
+displayTitle: Classification and regression
---
@@ -22,37 +22,14 @@ displayTitle: Classification and regression - spark.ml
\newcommand{\zero}{\mathbf{0}}
\]`
+This page covers algorithms for Classification and Regression. It also includes sections
+discussing specific classes of algorithms, such as linear methods, trees, and ensembles.
+
**Table of Contents**
* This will become a table of contents (this text will be scraped).
{:toc}
-In `spark.ml`, we implement popular linear methods such as logistic
-regression and linear least squares with $L_1$ or $L_2$ regularization.
-Refer to [the linear methods in mllib](mllib-linear-methods.html) for
-details about implementation and tuning. We also include a DataFrame API for [Elastic
-net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
-of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
-and variable selection via the elastic
-net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
-Mathematically, it is defined as a convex combination of the $L_1$ and
-the $L_2$ regularization terms:
-`\[
-\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
-\]`
-By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
-regularization as special cases. For example, if a [linear
-regression](https://en.wikipedia.org/wiki/Linear_regression) model is
-trained with the elastic net parameter $\alpha$ set to $1$, it is
-equivalent to a
-[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
-On the other hand, if $\alpha$ is set to $0$, the trained model reduces
-to a [ridge
-regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
-We implement Pipelines API for both linear regression and logistic
-regression with elastic net regularization.
-
-
# Classification
## Logistic regression
@@ -760,7 +737,34 @@ Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.ml.html#pyspa
</div>
</div>
+# Linear methods
+
+We implement popular linear methods such as logistic
+regression and linear least squares with $L_1$ or $L_2$ regularization.
+Refer to [the linear methods guide for the RDD-based API](mllib-linear-methods.html) for
+details about implementation and tuning; this information is still relevant.
+We also include a DataFrame API for [Elastic
+net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
+of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
+and variable selection via the elastic
+net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
+Mathematically, it is defined as a convex combination of the $L_1$ and
+the $L_2$ regularization terms:
+`\[
+\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
+\]`
+By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
+regularization as special cases. For example, if a [linear
+regression](https://en.wikipedia.org/wiki/Linear_regression) model is
+trained with the elastic net parameter $\alpha$ set to $1$, it is
+equivalent to a
+[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
+On the other hand, if $\alpha$ is set to $0$, the trained model reduces
+to a [ridge
+regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
+We implement Pipelines API for both linear regression and logistic
+regression with elastic net regularization.
# Decision trees
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index 8656eb4001..8a0a61cb59 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -1,10 +1,12 @@
---
layout: global
-title: Clustering - spark.ml
-displayTitle: Clustering - spark.ml
+title: Clustering
+displayTitle: Clustering
---
-In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).
+This page describes clustering algorithms in MLlib.
+The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information
+about these algorithms.
**Table of Contents**
diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md
index 8bd75f3bcf..1d02d6933c 100644
--- a/docs/ml-collaborative-filtering.md
+++ b/docs/ml-collaborative-filtering.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Collaborative Filtering - spark.ml
-displayTitle: Collaborative Filtering - spark.ml
+title: Collaborative Filtering
+displayTitle: Collaborative Filtering
---
* Table of contents
diff --git a/docs/ml-decision-tree.md b/docs/ml-decision-tree.md
index a721d55bc6..5e1eeb95e4 100644
--- a/docs/ml-decision-tree.md
+++ b/docs/ml-decision-tree.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Decision trees - spark.ml
-displayTitle: Decision trees - spark.ml
+title: Decision trees
+displayTitle: Decision trees
---
> This section has been moved into the
diff --git a/docs/ml-ensembles.md b/docs/ml-ensembles.md
index 303773e803..97f1bdc803 100644
--- a/docs/ml-ensembles.md
+++ b/docs/ml-ensembles.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Tree ensemble methods - spark.ml
-displayTitle: Tree ensemble methods - spark.ml
+title: Tree ensemble methods
+displayTitle: Tree ensemble methods
---
> This section has been moved into the
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 88fd291b4b..e7d7ddfe28 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Extracting, transforming and selecting features - spark.ml
-displayTitle: Extracting, transforming and selecting features - spark.ml
+title: Extracting, transforming and selecting features
+displayTitle: Extracting, transforming and selecting features
---
This section covers algorithms for working with features, roughly divided into these groups:
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index dae86d8480..5abec63b7a 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -1,323 +1,214 @@
---
layout: global
-title: "Overview: estimators, transformers and pipelines - spark.ml"
-displayTitle: "Overview: estimators, transformers and pipelines - spark.ml"
+title: "MLlib: Main Guide"
+displayTitle: "Machine Learning Library (MLlib) Guide"
---
+MLlib is Spark's machine learning (ML) library.
+Its goal is to make practical machine learning scalable and easy.
+At a high level, it provides tools such as:
-`\[
-\newcommand{\R}{\mathbb{R}}
-\newcommand{\E}{\mathbb{E}}
-\newcommand{\x}{\mathbf{x}}
-\newcommand{\y}{\mathbf{y}}
-\newcommand{\wv}{\mathbf{w}}
-\newcommand{\av}{\mathbf{\alpha}}
-\newcommand{\bv}{\mathbf{b}}
-\newcommand{\N}{\mathbb{N}}
-\newcommand{\id}{\mathbf{I}}
-\newcommand{\ind}{\mathbf{1}}
-\newcommand{\0}{\mathbf{0}}
-\newcommand{\unit}{\mathbf{e}}
-\newcommand{\one}{\mathbf{1}}
-\newcommand{\zero}{\mathbf{0}}
-\]`
+* ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
+* Featurization: feature extraction, transformation, dimensionality reduction, and selection
+* Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
+* Persistence: saving and load algorithms, models, and Pipelines
+* Utilities: linear algebra, statistics, data handling, etc.
+# Announcement: DataFrame-based API is primary API
-The `spark.ml` package aims to provide a uniform set of high-level APIs built on top of
-[DataFrames](sql-programming-guide.html#dataframes) that help users create and tune practical
-machine learning pipelines.
-See the [algorithm guides](#algorithm-guides) section below for guides on sub-packages of
-`spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.
+**The MLlib RDD-based API is now in maintenance mode.**
-**Table of contents**
+As of Spark 2.0, the [RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in the `spark.mllib` package have entered maintenance mode.
+The primary Machine Learning API for Spark is now the [DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
-* This will become a table of contents (this text will be scraped).
-{:toc}
+*What are the implications?*
+* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
+* MLlib will not add new features to the RDD-based API.
+* In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
+* After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
+* The RDD-based API is expected to be removed in Spark 3.0.
-# Main concepts in Pipelines
+*Why is MLlib switching to the DataFrame-based API?*
-Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple
-algorithms into a single pipeline, or workflow.
-This section covers the key concepts introduced by the Spark ML API, where the pipeline concept is
-mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
+* DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
+* The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
+* DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the [Pipelines guide](ml-pipeline.md) for details.
-* **[`DataFrame`](ml-guide.html#dataframe)**: Spark ML uses `DataFrame` from Spark SQL as an ML
- dataset, which can hold a variety of data types.
- E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
+# Dependencies
-* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
-E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
+MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
+[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
+If native libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
+implementation will be used instead.
-* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
-E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
+Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
+proxies by default.
+To configure `netlib-java` / Breeze to use system optimised binaries, include
+`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
+project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
+platform's additional installation instructions.
-* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
+To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
-* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
+[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
+ watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
-## DataFrame
+# Migration guide
-Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
-Spark ML adopts the `DataFrame` from Spark SQL in order to support a variety of data types.
+MLlib is under active development.
+The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
+and the migration guide below will explain all changes between releases.
-`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) for a list of supported types.
-In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.
+## From 1.6 to 2.0
-A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.
+### Breaking changes
-Columns in a `DataFrame` are named. The code examples below use names such as "text," "features," and "label."
+There were several breaking changes in Spark 2.0, which are outlined below.
-## Pipeline components
+**Linear algebra classes for DataFrame-based APIs**
-### Transformers
+Spark's linear algebra dependencies were moved to a new project, `mllib-local`
+(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)).
+As part of this change, the linear algebra classes were copied to a new package, `spark.ml.linalg`.
+The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes,
+leading to a few breaking changes, predominantly in various model classes
+(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list).
-A `Transformer` is an abstraction that includes feature transformers and learned models.
-Technically, a `Transformer` implements a method `transform()`, which converts one `DataFrame` into
-another, generally by appending one or more columns.
-For example:
+**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package `spark.mllib.linalg`.
-* A feature transformer might take a `DataFrame`, read a column (e.g., text), map it into a new
- column (e.g., feature vectors), and output a new `DataFrame` with the mapped column appended.
-* A learning model might take a `DataFrame`, read the column containing feature vectors, predict the
- label for each feature vector, and output a new `DataFrame` with predicted labels appended as a
- column.
+_Converting vectors and matrices_
-### Estimators
+While most pipeline components support backward compatibility for loading,
+some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that contain vector or matrix
+columns, may need to be migrated to the new `spark.ml` vector and matrix types.
+Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to `spark.ml.linalg` types
+(and vice versa) can be found in `spark.mllib.util.MLUtils`.
-An `Estimator` abstracts the concept of a learning algorithm or any algorithm that fits or trains on
-data.
-Technically, an `Estimator` implements a method `fit()`, which accepts a `DataFrame` and produces a
-`Model`, which is a `Transformer`.
-For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling
-`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a `Transformer`.
-
-### Properties of pipeline components
-
-`Transformer.transform()`s and `Estimator.fit()`s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
-
-Each instance of a `Transformer` or `Estimator` has a unique ID, which is useful in specifying parameters (discussed below).
-
-## Pipeline
-
-In machine learning, it is common to run a sequence of algorithms to process and learn from data.
-E.g., a simple text document processing workflow might include several stages:
-
-* Split each document's text into words.
-* Convert each document's words into a numerical feature vector.
-* Learn a prediction model using the feature vectors and labels.
-
-Spark ML represents such a workflow as a `Pipeline`, which consists of a sequence of
-`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific order.
-We will use this simple workflow as a running example in this section.
-
-### How it works
-
-A `Pipeline` is specified as a sequence of stages, and each stage is either a `Transformer` or an `Estimator`.
-These stages are run in order, and the input `DataFrame` is transformed as it passes through each stage.
-For `Transformer` stages, the `transform()` method is called on the `DataFrame`.
-For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the `DataFrame`.
-
-We illustrate this for the simple text document workflow. The figure below is for the *training time* usage of a `Pipeline`.
-
-<p style="text-align: center;">
- <img
- src="img/ml-Pipeline.png"
- title="Spark ML Pipeline Example"
- alt="Spark ML Pipeline Example"
- width="80%"
- />
-</p>
-
-Above, the top row represents a `Pipeline` with three stages.
-The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the third (`LogisticRegression`) is an `Estimator` (red).
-The bottom row represents data flowing through the pipeline, where cylinders indicate `DataFrame`s.
-The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw text documents and labels.
-The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`.
-The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`.
-Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
-If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()`
-method on the `DataFrame` before passing the `DataFrame` to the next stage.
-
-A `Pipeline` is an `Estimator`.
-Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
-`Transformer`.
-This `PipelineModel` is used at *test time*; the figure below illustrates this usage.
-
-<p style="text-align: center;">
- <img
- src="img/ml-PipelineModel.png"
- title="Spark ML PipelineModel Example"
- alt="Spark ML PipelineModel Example"
- width="80%"
- />
-</p>
-
-In the figure above, the `PipelineModel` has the same number of stages as the original `Pipeline`, but all `Estimator`s in the original `Pipeline` have become `Transformer`s.
-When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed
-through the fitted pipeline in order.
-Each stage's `transform()` method updates the dataset and passes it to the next stage.
-
-`Pipeline`s and `PipelineModel`s help to ensure that training and test data go through identical feature processing steps.
-
-### Details
-
-*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array. The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in which each stage uses data produced by the previous stage. It is possible to create non-linear `Pipeline`s as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the `Pipeline` forms a DAG, then the stages must be specified in topological order.
-
-*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied types, they cannot use
-compile-time type checking.
-`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
-This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.
-
-*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances. E.g., the same instance
-`myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
-unique IDs. However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
-can be put into the same `Pipeline` since different instances will be created with different IDs.
-
-## Parameters
-
-Spark ML `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
-
-A `Param` is a named parameter with self-contained documentation.
-A `ParamMap` is a set of (parameter, value) pairs.
-
-There are two main ways to pass parameters to an algorithm:
-
-1. Set parameters for an instance. E.g., if `lr` is an instance of `LogisticRegression`, one could
- call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
- This API resembles the API used in `spark.mllib` package.
-2. Pass a `ParamMap` to `fit()` or `transform()`. Any parameters in the `ParamMap` will override parameters previously specified via setter methods.
-
-Parameters belong to specific instances of `Estimator`s and `Transformer`s.
-For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
-This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
-
-## Saving and Loading Pipelines
-
-Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm's API documentation to see if saving and loading is supported.
-
-# Code examples
-
-This section gives code examples illustrating the functionality discussed above.
-For more info, please refer to the API documentation
-([Scala](api/scala/index.html#org.apache.spark.ml.package),
-[Java](api/java/org/apache/spark/ml/package-summary.html),
-and [Python](api/python/pyspark.ml.html)).
-Some Spark ML algorithms are wrappers for `spark.mllib` algorithms, and the
-[MLlib programming guide](mllib-guide.html) has details on specific algorithms.
-
-## Example: Estimator, Transformer, and Param
-
-This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
+There are also utility methods available for converting single instances of
+vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / `mllib.linalg.Matrix`
+for converting to `ml.linalg` types, and
+`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML`
+for converting to `mllib.linalg` types.
<div class="codetabs">
+<div data-lang="scala" markdown="1">
-<div data-lang="scala">
-{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
-</div>
+{% highlight scala %}
+import org.apache.spark.mllib.util.MLUtils
-<div data-lang="java">
-{% include_example java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
-</div>
+// convert DataFrame columns
+val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
+val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
+// convert a single vector or matrix
+val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
+val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
+{% endhighlight %}
-<div data-lang="python">
-{% include_example python/ml/estimator_transformer_param_example.py %}
-</div>
-
-</div>
-
-## Example: Pipeline
-
-This example follows the simple text document `Pipeline` illustrated in the figures above.
-
-<div class="codetabs">
-
-<div data-lang="scala">
-{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
-</div>
-
-<div data-lang="java">
-{% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java %}
-</div>
-
-<div data-lang="python">
-{% include_example python/ml/pipeline_example.py %}
-</div>
-
-</div>
-
-## Example: model selection via cross-validation
-
-An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*.
-`Pipeline`s facilitate model selection by making it easy to tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
-
-Currently, `spark.ml` supports model selection using the [`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) class, which takes an `Estimator`, a set of `ParamMap`s, and an [`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator).
-`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets; e.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
-`CrossValidator` iterates through the set of `ParamMap`s. For each `ParamMap`, it trains the given `Estimator` and evaluates it using the given `Evaluator`.
-
-The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
-for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
-for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
-for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
-method in each of these evaluators.
-
-The `ParamMap` which produces the best evaluation metric (averaged over the `$k$` folds) is selected as the best model.
-`CrossValidator` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.
-
-The following example demonstrates using `CrossValidator` to select from a grid of parameters.
-To help construct the parameter grid, we use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
-
-Note that cross-validation over a grid of parameters is expensive.
-E.g., in the example below, the parameter grid has 3 values for `hashingTF.numFeatures` and 2 values for `lr.regParam`, and `CrossValidator` uses 2 folds. This multiplies out to `$(3 \times 2) \times 2 = 12$` different models being trained.
-In realistic settings, it can be common to try many more parameters and use more folds (`$k=3$` and `$k=10$` are common).
-In other words, using `CrossValidator` can be very expensive.
-However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.
-
-<div class="codetabs">
-
-<div data-lang="scala">
-{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
-</div>
-
-<div data-lang="java">
-{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java %}
-</div>
-
-<div data-lang="python">
-
-{% include_example python/ml/cross_validator.py %}
-</div>
-
-</div>
-
-## Example: model selection via train validation split
-In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
-`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
- the case of `CrossValidator`. It is therefore less expensive,
- but will not produce as reliable results when the training dataset is not sufficiently large.
-
-`TrainValidationSplit` takes an `Estimator`, a set of `ParamMap`s provided in the `estimatorParamMaps` parameter,
-and an `Evaluator`.
-It begins by splitting the dataset into two parts using the `trainRatio` parameter
-which are used as separate training and test datasets. For example with `$trainRatio=0.75$` (default),
-`TrainValidationSplit` will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
-Similar to `CrossValidator`, `TrainValidationSplit` also iterates through the set of `ParamMap`s.
-For each combination of parameters, it trains the given `Estimator` and evaluates it using the given `Evaluator`.
-The `ParamMap` which produces the best evaluation metric is selected as the best option.
-`TrainValidationSplit` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.
-
-<div class="codetabs">
-
-<div data-lang="scala" markdown="1">
-{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
+Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail.
</div>
<div data-lang="java" markdown="1">
-{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java %}
-</div>
-<div data-lang="python">
-{% include_example python/ml/train_validation_split.py %}
-</div>
+{% highlight java %}
+import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.sql.Dataset;
+
+// convert DataFrame columns
+Dataset<Row> convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
+Dataset<Row> convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
+// convert a single vector or matrix
+org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
+org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
+{% endhighlight %}
+
+Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.
+</div>
+
+<div data-lang="python" markdown="1">
+
+{% highlight python %}
+from pyspark.mllib.util import MLUtils
+
+# convert DataFrame columns
+convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
+convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
+# convert a single vector or matrix
+mlVec = mllibVec.asML()
+mlMat = mllibMat.asML()
+{% endhighlight %}
+
+Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further detail.
+</div>
+</div>
+
+**Deprecated methods removed**
+
+Several deprecated methods were removed in the `spark.mllib` and `spark.ml` packages:
+
+* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
+* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
+* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as `DeveloperApi`)
+* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these functions are available on `RDD`s directly, and were marked as `DeveloperApi`)
+* `defaultStategy` in `mllib.tree.configuration.Strategy`
+* `build` in `mllib.tree.Node`
+* libsvm loaders for multiclass and load/save labeledData methods in `mllib.util.MLUtils`
+
+A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
+
+### Deprecations and changes of behavior
+
+**Deprecations**
+
+Deprecations in the `spark.mllib` and `spark.ml` packages include:
+
+* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
+ In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
+* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
+ In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
+ the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
+* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
+ In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
+ We move all functionality in overridden methods to the corresponding `transformSchema`.
+* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
+ In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
+ We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
+* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
+ In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
+* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
+ In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`.
+* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`.
+
+**Changes of behavior**
+
+Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
+
+* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
+ `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
+ This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
+ * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
+ * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
+* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
+ In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
+ the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
+* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
+ Fix a bug of `PowerIterationClustering` which will likely change its result.
+* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
+ `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
+* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
+ `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
+* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
+ `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
+* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
+ The `expectedType` argument for PySpark `Param` was removed.
+* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
+ Some default `Param` values, which were mismatched between pipelines in Scala and Python, have been changed.
+* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
+ `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
+ The output buckets will differ for same input data and params.
+
+## Previous Spark versions
+
+Earlier migration guides are archived [on this page](ml-migration-guides.html).
-</div>
+---
diff --git a/docs/ml-linear-methods.md b/docs/ml-linear-methods.md
index a8754835ca..eb39173505 100644
--- a/docs/ml-linear-methods.md
+++ b/docs/ml-linear-methods.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Linear methods - spark.ml
-displayTitle: Linear methods - spark.ml
+title: Linear methods
+displayTitle: Linear methods
---
> This section has been moved into the
diff --git a/docs/ml-migration-guides.md b/docs/ml-migration-guides.md
new file mode 100644
index 0000000000..82bf9d7760
--- /dev/null
+++ b/docs/ml-migration-guides.md
@@ -0,0 +1,159 @@
+---
+layout: global
+title: Old Migration Guides - MLlib
+displayTitle: Old Migration Guides - MLlib
+description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
+---
+
+The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
+
+## From 1.5 to 1.6
+
+There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
+deprecations and changes of behavior.
+
+Deprecations:
+
+* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
+ In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
+* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
+ In `spark.ml.classification.LogisticRegressionModel` and
+ `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
+ the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to
+ algorithms.
+
+Changes of behavior:
+
+* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
+ `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
+ Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
+ `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
+ previous error); for small errors (`< 0.01`), it uses absolute error.
+* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
+ `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
+ tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
+ behavior of the simpler `Tokenizer` transformer.
+
+## From 1.4 to 1.5
+
+In the `spark.mllib` package, there are no breaking API changes but several behavior changes:
+
+* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
+ `RegressionMetrics.explainedVariance` returns the average regression sum of squares.
+* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
+ sorted.
+* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
+ convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.
+
+In the `spark.ml` package, there exists one breaking API change and one behavior change:
+
+* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
+ from `Params.setDefault` due to a
+ [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
+* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
+ added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.
+
+## From 1.3 to 1.4
+
+In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
+
+* Gradient-Boosted Trees
+ * *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
+ * *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
+* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
+
+In the `spark.ml` package, several major API changes occurred, including:
+
+* `Param` and other APIs for specifying parameters
+* `uid` unique IDs for Pipeline components
+* Reorganization of certain classes
+
+Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all changes here.
+However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API
+changes for future releases.
+
+## From 1.2 to 1.3
+
+In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
+
+* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed.
+* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
+* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD) remains an Experimental component. In it, there were two changes:
+ * The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods.
+ * Variable `model` is no longer public.
+* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) remains an Experimental component. In it and its associated classes, there were several changes:
+ * In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.)
+ * In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
+* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use.
+* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
+ So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.
+
+In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:
+
+* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in `spark.ml` which used to use SchemaRDD now use DataFrame.
+* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
+* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
+
+Other changes were in `LogisticRegression`:
+
+* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
+* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.
+
+## From 1.1 to 1.2
+
+The only API changes in MLlib v1.2 are in
+[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+which continues to be an experimental API in MLlib 1.2:
+
+1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number
+of classes. In MLlib v1.1, this argument was called `numClasses` in Python and
+`numClassesForClassification` in Scala. In MLlib v1.2, the names are both set to `numClasses`.
+This `numClasses` parameter is specified either via
+[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
+or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
+static `trainClassifier` and `trainRegressor` methods.
+
+2. *(Breaking change)* The API for
+[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has changed.
+This should generally not affect user code, unless the user manually constructs decision trees
+(instead of using the `trainClassifier` or `trainRegressor` methods).
+The tree `Node` now includes more information, including the probability of the predicted label
+(for classification).
+
+3. Printing methods' output has changed. The `toString` (Scala/Java) and `__repr__` (Python) methods used to print the full model; they now print a summary. For the full model, use `toDebugString`.
+
+Examples in the Spark distribution and examples in the
+[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated accordingly.
+
+## From 1.0 to 1.1
+
+The only API changes in MLlib v1.1 are in
+[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+which continues to be an experimental API in MLlib 1.1:
+
+1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match
+the implementations of trees in
+[scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
+and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
+In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf nodes.
+In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf nodes.
+This depth is specified by the `maxDepth` parameter in
+[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
+or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
+static `trainClassifier` and `trainRegressor` methods.
+
+2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor`
+methods to build a [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+rather than using the old parameter class `Strategy`. These new training methods explicitly
+separate classification and regression, and they replace specialized parameter types with
+simple `String` types.
+
+Examples of the new, recommended `trainClassifier` and `trainRegressor` are given in the
+[Decision Trees Guide](mllib-decision-tree.html#examples).
+
+## From 0.9 to 1.0
+
+In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
+breaking changes. If your data is sparse, please store it in a sparse format instead of dense to
+take advantage of sparsity in both storage and computation. Details are described below.
+
diff --git a/docs/ml-pipeline.md b/docs/ml-pipeline.md
new file mode 100644
index 0000000000..adb057ba7e
--- /dev/null
+++ b/docs/ml-pipeline.md
@@ -0,0 +1,245 @@
+---
+layout: global
+title: ML Pipelines
+displayTitle: ML Pipelines
+---
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+In this section, we introduce the concept of ***ML Pipelines***.
+ML Pipelines provide a uniform set of high-level APIs built on top of
+[DataFrames](sql-programming-guide.html) that help users create and tune practical
+machine learning pipelines.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Main concepts in Pipelines
+
+MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple
+algorithms into a single pipeline, or workflow.
+This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
+mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
+
+* **[`DataFrame`](ml-guide.html#dataframe)**: This ML API uses `DataFrame` from Spark SQL as an ML
+ dataset, which can hold a variety of data types.
+ E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
+
+* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
+E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
+
+* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
+E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
+
+* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple `Transformer`s and `Estimator`s together to specify an ML workflow.
+
+* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and `Estimator`s now share a common API for specifying parameters.
+
+## DataFrame
+
+Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data.
+This API adopts the `DataFrame` from Spark SQL in order to support a variety of data types.
+
+`DataFrame` supports many basic and structured types; see the [Spark SQL datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) for a list of supported types.
+In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML [`Vector`](mllib-data-types.html#local-vector) types.
+
+A `DataFrame` can be created either implicitly or explicitly from a regular `RDD`. See the code examples below and the [Spark SQL programming guide](sql-programming-guide.html) for examples.
+
+Columns in a `DataFrame` are named. The code examples below use names such as "text," "features," and "label."
+
+## Pipeline components
+
+### Transformers
+
+A `Transformer` is an abstraction that includes feature transformers and learned models.
+Technically, a `Transformer` implements a method `transform()`, which converts one `DataFrame` into
+another, generally by appending one or more columns.
+For example:
+
+* A feature transformer might take a `DataFrame`, read a column (e.g., text), map it into a new
+ column (e.g., feature vectors), and output a new `DataFrame` with the mapped column appended.
+* A learning model might take a `DataFrame`, read the column containing feature vectors, predict the
+ label for each feature vector, and output a new `DataFrame` with predicted labels appended as a
+ column.
+
+### Estimators
+
+An `Estimator` abstracts the concept of a learning algorithm or any algorithm that fits or trains on
+data.
+Technically, an `Estimator` implements a method `fit()`, which accepts a `DataFrame` and produces a
+`Model`, which is a `Transformer`.
+For example, a learning algorithm such as `LogisticRegression` is an `Estimator`, and calling
+`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a `Transformer`.
+
+### Properties of pipeline components
+
+`Transformer.transform()`s and `Estimator.fit()`s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
+
+Each instance of a `Transformer` or `Estimator` has a unique ID, which is useful in specifying parameters (discussed below).
+
+## Pipeline
+
+In machine learning, it is common to run a sequence of algorithms to process and learn from data.
+E.g., a simple text document processing workflow might include several stages:
+
+* Split each document's text into words.
+* Convert each document's words into a numerical feature vector.
+* Learn a prediction model using the feature vectors and labels.
+
+MLlib represents such a workflow as a `Pipeline`, which consists of a sequence of
+`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific order.
+We will use this simple workflow as a running example in this section.
+
+### How it works
+
+A `Pipeline` is specified as a sequence of stages, and each stage is either a `Transformer` or an `Estimator`.
+These stages are run in order, and the input `DataFrame` is transformed as it passes through each stage.
+For `Transformer` stages, the `transform()` method is called on the `DataFrame`.
+For `Estimator` stages, the `fit()` method is called to produce a `Transformer` (which becomes part of the `PipelineModel`, or fitted `Pipeline`), and that `Transformer`'s `transform()` method is called on the `DataFrame`.
+
+We illustrate this for the simple text document workflow. The figure below is for the *training time* usage of a `Pipeline`.
+
+<p style="text-align: center;">
+ <img
+ src="img/ml-Pipeline.png"
+ title="ML Pipeline Example"
+ alt="ML Pipeline Example"
+ width="80%"
+ />
+</p>
+
+Above, the top row represents a `Pipeline` with three stages.
+The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the third (`LogisticRegression`) is an `Estimator` (red).
+The bottom row represents data flowing through the pipeline, where cylinders indicate `DataFrame`s.
+The `Pipeline.fit()` method is called on the original `DataFrame`, which has raw text documents and labels.
+The `Tokenizer.transform()` method splits the raw text documents into words, adding a new column with words to the `DataFrame`.
+The `HashingTF.transform()` method converts the words column into feature vectors, adding a new column with those vectors to the `DataFrame`.
+Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls `LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
+If the `Pipeline` had more stages, it would call the `LogisticRegressionModel`'s `transform()`
+method on the `DataFrame` before passing the `DataFrame` to the next stage.
+
+A `Pipeline` is an `Estimator`.
+Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, which is a
+`Transformer`.
+This `PipelineModel` is used at *test time*; the figure below illustrates this usage.
+
+<p style="text-align: center;">
+ <img
+ src="img/ml-PipelineModel.png"
+ title="ML PipelineModel Example"
+ alt="ML PipelineModel Example"
+ width="80%"
+ />
+</p>
+
+In the figure above, the `PipelineModel` has the same number of stages as the original `Pipeline`, but all `Estimator`s in the original `Pipeline` have become `Transformer`s.
+When the `PipelineModel`'s `transform()` method is called on a test dataset, the data are passed
+through the fitted pipeline in order.
+Each stage's `transform()` method updates the dataset and passes it to the next stage.
+
+`Pipeline`s and `PipelineModel`s help to ensure that training and test data go through identical feature processing steps.
+
+### Details
+
+*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array. The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in which each stage uses data produced by the previous stage. It is possible to create non-linear `Pipeline`s as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the `Pipeline` forms a DAG, then the stages must be specified in topological order.
+
+*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied types, they cannot use
+compile-time type checking.
+`Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`.
+This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`.
+
+*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances. E.g., the same instance
+`myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have
+unique IDs. However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`)
+can be put into the same `Pipeline` since different instances will be created with different IDs.
+
+## Parameters
+
+MLlib `Estimator`s and `Transformer`s use a uniform API for specifying parameters.
+
+A `Param` is a named parameter with self-contained documentation.
+A `ParamMap` is a set of (parameter, value) pairs.
+
+There are two main ways to pass parameters to an algorithm:
+
+1. Set parameters for an instance. E.g., if `lr` is an instance of `LogisticRegression`, one could
+ call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
+ This API resembles the API used in `spark.mllib` package.
+2. Pass a `ParamMap` to `fit()` or `transform()`. Any parameters in the `ParamMap` will override parameters previously specified via setter methods.
+
+Parameters belong to specific instances of `Estimator`s and `Transformer`s.
+For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
+This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`.
+
+## Saving and Loading Pipelines
+
+Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API. Most basic transformers are supported as well as some of the more basic ML models. Please refer to the algorithm's API documentation to see if saving and loading is supported.
+
+# Code examples
+
+This section gives code examples illustrating the functionality discussed above.
+For more info, please refer to the API documentation
+([Scala](api/scala/index.html#org.apache.spark.ml.package),
+[Java](api/java/org/apache/spark/ml/package-summary.html),
+and [Python](api/python/pyspark.ml.html)).
+
+## Example: Estimator, Transformer, and Param
+
+This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/estimator_transformer_param_example.py %}
+</div>
+
+</div>
+
+## Example: Pipeline
+
+This example follows the simple text document `Pipeline` illustrated in the figures above.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/pipeline_example.py %}
+</div>
+
+</div>
+
+## Model selection (hyperparameter tuning)
+
+A big benefit of using ML Pipelines is hyperparameter optimization. See the [ML Tuning Guide](ml-tuning.html) for more information on automatic model selection.
diff --git a/docs/ml-survival-regression.md b/docs/ml-survival-regression.md
index 856ceb2f4e..efa3c21c7c 100644
--- a/docs/ml-survival-regression.md
+++ b/docs/ml-survival-regression.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Survival Regression - spark.ml
-displayTitle: Survival Regression - spark.ml
+title: Survival Regression
+displayTitle: Survival Regression
---
> This section has been moved into the
diff --git a/docs/ml-tuning.md b/docs/ml-tuning.md
new file mode 100644
index 0000000000..2ca90c7092
--- /dev/null
+++ b/docs/ml-tuning.md
@@ -0,0 +1,121 @@
+---
+layout: global
+title: "ML Tuning"
+displayTitle: "ML Tuning: model selection and hyperparameter tuning"
+---
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+This section describes how to use MLlib's tooling for tuning ML algorithms and Pipelines.
+Built-in Cross-Validation and other tooling allow users to optimize hyperparameters in algorithms and Pipelines.
+
+**Table of contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Model selection (a.k.a. hyperparameter tuning)
+
+An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*.
+Tuning may be done for individual `Estimator`s such as `LogisticRegression`, or for entire `Pipeline`s which include multiple algorithms, featurization, and other steps. Users can tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
+
+MLlib supports model selection using tools such as [`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) and [`TrainValidationSplit`](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit).
+These tools require the following items:
+
+* [`Estimator`](api/scala/index.html#org.apache.spark.ml.Estimator): algorithm or `Pipeline` to tune
+* Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter grid" to search over
+* [`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator): metric to measure how well a fitted `Model` does on held-out test data
+
+At a high level, these model selection tools work as follows:
+
+* They split the input data into separate training and test datasets.
+* For each (training, test) pair, they iterate through the set of `ParamMap`s:
+ * For each `ParamMap`, they fit the `Estimator` using those parameters, get the fitted `Model`, and evaluate the `Model`'s performance using the `Evaluator`.
+* They select the `Model` produced by the best-performing set of parameters.
+
+The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
+for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
+for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
+for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
+method in each of these evaluators.
+
+To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
+
+# Cross-Validation
+
+`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets. E.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 `Model`s produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.
+
+After identifying the best `ParamMap`, `CrossValidator` finally re-fits the `Estimator` using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via cross-validation
+
+The following example demonstrates using `CrossValidator` to select from a grid of parameters.
+
+Note that cross-validation over a grid of parameters is expensive.
+E.g., in the example below, the parameter grid has 3 values for `hashingTF.numFeatures` and 2 values for `lr.regParam`, and `CrossValidator` uses 2 folds. This multiplies out to `$(3 \times 2) \times 2 = 12$` different models being trained.
+In realistic settings, it can be common to try many more parameters and use more folds (`$k=3$` and `$k=10$` are common).
+In other words, using `CrossValidator` can be very expensive.
+However, it is also a well-established method for choosing parameters which is more statistically sound than heuristic hand-tuning.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java %}
+</div>
+
+<div data-lang="python">
+
+{% include_example python/ml/cross_validator.py %}
+</div>
+
+</div>
+
+# Train-Validation Split
+
+In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
+`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
+ the case of `CrossValidator`. It is therefore less expensive,
+ but will not produce as reliable results when the training dataset is not sufficiently large.
+
+Unlike `CrossValidator`, `TrainValidationSplit` creates a single (training, test) dataset pair.
+It splits the dataset into these two parts using the `trainRatio` parameter. For example with `$trainRatio=0.75$`,
+`TrainValidationSplit` will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
+
+Like `CrossValidator`, `TrainValidationSplit` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via train validation split
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/train_validation_split.py %}
+</div>
+
+</div>
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md
index aaf8bd465c..a7b90de093 100644
--- a/docs/mllib-classification-regression.md
+++ b/docs/mllib-classification-regression.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Classification and Regression - spark.mllib
-displayTitle: Classification and Regression - spark.mllib
+title: Classification and Regression - RDD-based API
+displayTitle: Classification and Regression - RDD-based API
---
The `spark.mllib` package supports various methods for
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 073927c30b..d5f6ae379a 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Clustering - spark.mllib
-displayTitle: Clustering - spark.mllib
+title: Clustering - RDD-based API
+displayTitle: Clustering - RDD-based API
---
[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) is an unsupervised learning problem whereby we aim to group subsets
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index 5c33292aaf..0f891a09a6 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Collaborative Filtering - spark.mllib
-displayTitle: Collaborative Filtering - spark.mllib
+title: Collaborative Filtering - RDD-based API
+displayTitle: Collaborative Filtering - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index ef56aebbc3..7dd3c97a83 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Data Types - MLlib
-displayTitle: Data Types - MLlib
+title: Data Types - RDD-based API
+displayTitle: Data Types - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 11f5de1fc9..0e753b8dd0 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Decision Trees - spark.mllib
-displayTitle: Decision Trees - spark.mllib
+title: Decision Trees - RDD-based API
+displayTitle: Decision Trees - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
index cceddce9f7..539cbc1b31 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Dimensionality Reduction - spark.mllib
-displayTitle: Dimensionality Reduction - spark.mllib
+title: Dimensionality Reduction - RDD-based API
+displayTitle: Dimensionality Reduction - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 5543262a89..e1984b6c8d 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Ensembles - spark.mllib
-displayTitle: Ensembles - spark.mllib
+title: Ensembles - RDD-based API
+displayTitle: Ensembles - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index c49bc4ff12..ac82f43cfb 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Evaluation Metrics - spark.mllib
-displayTitle: Evaluation Metrics - spark.mllib
+title: Evaluation Metrics - RDD-based API
+displayTitle: Evaluation Metrics - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 67c033e9e4..867be7f293 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Feature Extraction and Transformation - spark.mllib
-displayTitle: Feature Extraction and Transformation - spark.mllib
+title: Feature Extraction and Transformation - RDD-based API
+displayTitle: Feature Extraction and Transformation - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md
index a7b55dc5e5..93e3f0b2d2 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Frequent Pattern Mining - spark.mllib
-displayTitle: Frequent Pattern Mining - spark.mllib
+title: Frequent Pattern Mining - RDD-based API
+displayTitle: Frequent Pattern Mining - RDD-based API
---
Mining frequent items, itemsets, subsequences, or other substructures is usually among the
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 17fd3e1edf..30112c72c9 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -1,32 +1,12 @@
---
layout: global
-title: MLlib
-displayTitle: Machine Learning Library (MLlib) Guide
-description: MLlib machine learning library overview for Spark SPARK_VERSION_SHORT
+title: "MLlib: RDD-based API"
+displayTitle: "MLlib: RDD-based API"
---
-MLlib is Spark's machine learning (ML) library.
-Its goal is to make practical machine learning scalable and easy.
-It consists of common learning algorithms and utilities, including classification, regression,
-clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization
-primitives and higher-level pipeline APIs.
-
-It divides into two packages:
-
-* [`spark.mllib`](mllib-guide.html#data-types-algorithms-and-utilities) contains the original API
- built on top of [RDDs](programming-guide.html#resilient-distributed-datasets-rdds).
-* [`spark.ml`](ml-guide.html) provides higher-level API
- built on top of [DataFrames](sql-programming-guide.html#dataframes) for constructing ML pipelines.
-
-Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
-But we will keep supporting `spark.mllib` along with the development of `spark.ml`.
-Users should be comfortable using `spark.mllib` features and expect more features coming.
-Developers should contribute new algorithms to `spark.ml` if they fit the ML pipeline concept well,
-e.g., feature extractors and transformers.
-
-We list major functionality from both below, with links to detailed guides.
-
-# spark.mllib: data types, algorithms, and utilities
+This page documents sections of the MLlib guide for the RDD-based API (the `spark.mllib` package).
+Please see the [MLlib Main Guide](ml-guide.html) for the DataFrame-based API (the `spark.ml` package),
+which is now the primary API for MLlib.
* [Data types](mllib-data-types.html)
* [Basic statistics](mllib-statistics.html)
@@ -65,192 +45,3 @@ We list major functionality from both below, with links to detailed guides.
* [stochastic gradient descent](mllib-optimization.html#stochastic-gradient-descent-sgd)
* [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
-# spark.ml: high-level APIs for ML pipelines
-
-* [Overview: estimators, transformers and pipelines](ml-guide.html)
-* [Extracting, transforming and selecting features](ml-features.html)
-* [Classification and regression](ml-classification-regression.html)
-* [Clustering](ml-clustering.html)
-* [Collaborative filtering](ml-collaborative-filtering.html)
-* [Advanced topics](ml-advanced.html)
-
-Some techniques are not available yet in spark.ml, most notably dimensionality reduction
-Users can seamlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`.
-
-# Dependencies
-
-MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
-[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
-If natives libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
-implementation will be used instead.
-
-Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
-proxies by default.
-To configure `netlib-java` / Breeze to use system optimised binaries, include
-`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
-project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
-platform's additional installation instructions.
-
-To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
-
-[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
- watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
-
-# Migration guide
-
-MLlib is under active development.
-The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
-and the migration guide below will explain all changes between releases.
-
-## From 1.6 to 2.0
-
-### Breaking changes
-
-There were several breaking changes in Spark 2.0, which are outlined below.
-
-**Linear algebra classes for DataFrame-based APIs**
-
-Spark's linear algebra dependencies were moved to a new project, `mllib-local`
-(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)).
-As part of this change, the linear algebra classes were copied to a new package, `spark.ml.linalg`.
-The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes,
-leading to a few breaking changes, predominantly in various model classes
-(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list).
-
-**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package `spark.mllib.linalg`.
-
-_Converting vectors and matrices_
-
-While most pipeline components support backward compatibility for loading,
-some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that contain vector or matrix
-columns, may need to be migrated to the new `spark.ml` vector and matrix types.
-Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to `spark.ml.linalg` types
-(and vice versa) can be found in `spark.mllib.util.MLUtils`.
-
-There are also utility methods available for converting single instances of
-vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / `mllib.linalg.Matrix`
-for converting to `ml.linalg` types, and
-`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML`
-for converting to `mllib.linalg` types.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-{% highlight scala %}
-import org.apache.spark.mllib.util.MLUtils
-
-// convert DataFrame columns
-val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-// convert a single vector or matrix
-val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
-val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
-{% endhighlight %}
-
-Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail.
-</div>
-
-<div data-lang="java" markdown="1">
-
-{% highlight java %}
-import org.apache.spark.mllib.util.MLUtils;
-import org.apache.spark.sql.Dataset;
-
-// convert DataFrame columns
-Dataset<Row> convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
-Dataset<Row> convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
-// convert a single vector or matrix
-org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
-org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
-{% endhighlight %}
-
-Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.
-</div>
-
-<div data-lang="python" markdown="1">
-
-{% highlight python %}
-from pyspark.mllib.util import MLUtils
-
-# convert DataFrame columns
-convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-# convert a single vector or matrix
-mlVec = mllibVec.asML()
-mlMat = mllibMat.asML()
-{% endhighlight %}
-
-Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further detail.
-</div>
-</div>
-
-**Deprecated methods removed**
-
-Several deprecated methods were removed in the `spark.mllib` and `spark.ml` packages:
-
-* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
-* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
-* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as `DeveloperApi`)
-* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these functions are available on `RDD`s directly, and were marked as `DeveloperApi`)
-* `defaultStategy` in `mllib.tree.configuration.Strategy`
-* `build` in `mllib.tree.Node`
-* libsvm loaders for multiclass and load/save labeledData methods in `mllib.util.MLUtils`
-
-A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
-
-### Deprecations and changes of behavior
-
-**Deprecations**
-
-Deprecations in the `spark.mllib` and `spark.ml` packages include:
-
-* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
- In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated.
-* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
- In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`,
- the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
-* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
- In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
- We move all functionality in overridden methods to the corresponding `transformSchema`.
-* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
- In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
- We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
-* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
- In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
-* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
- In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`.
-* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`.
-
-**Changes of behavior**
-
-Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
-
-* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
- `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
- This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
- * The intercept will not be regularized when training binary classification model with L1/L2 Updater.
- * If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
-* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
- In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
- the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
-* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
- Fix a bug of `PowerIterationClustering` which will likely change its result.
-* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
- `LDA` using the `EM` optimizer will keep the last checkpoint by default, if checkpointing is being used.
-* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
- `Word2Vec` now respects sentence boundaries. Previously, it did not handle them correctly.
-* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
- `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` and `spark.mllib`.
-* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
- The `expectedType` argument for PySpark `Param` was removed.
-* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
- Some default `Param` values, which were mismatched between pipelines in Scala and Python, have been changed.
-* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
- `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
- The output buckets will differ for same input data and params.
-
-## Previous Spark versions
-
-Earlier migration guides are archived [on this page](mllib-migration-guides.html).
-
----
diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md
index 8ede4407d5..d90905a86a 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Isotonic regression - spark.mllib
-displayTitle: Regression - spark.mllib
+title: Isotonic regression - RDD-based API
+displayTitle: Regression - RDD-based API
---
## Isotonic regression
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 17d781ac23..6fcd3ae857 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Linear Methods - spark.mllib
-displayTitle: Linear Methods - spark.mllib
+title: Linear Methods - RDD-based API
+displayTitle: Linear Methods - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md
index 970c6697f4..ea6f93fcf6 100644
--- a/docs/mllib-migration-guides.md
+++ b/docs/mllib-migration-guides.md
@@ -1,159 +1,9 @@
---
layout: global
-title: Old Migration Guides - spark.mllib
-displayTitle: Old Migration Guides - spark.mllib
-description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
+title: Old Migration Guides - MLlib
+displayTitle: Old Migration Guides - MLlib
---
-The migration guide for the current Spark version is kept on the [MLlib Programming Guide main page](mllib-guide.html#migration-guide).
-
-## From 1.5 to 1.6
-
-There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
-deprecations and changes of behavior.
-
-Deprecations:
-
-* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
- In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
-* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
- In `spark.ml.classification.LogisticRegressionModel` and
- `spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
- the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to
- algorithms.
-
-Changes of behavior:
-
-* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
- `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
- Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
- `GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
- previous error); for small errors (`< 0.01`), it uses absolute error.
-* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
- `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
- tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
- behavior of the simpler `Tokenizer` transformer.
-
-## From 1.4 to 1.5
-
-In the `spark.mllib` package, there are no breaking API changes but several behavior changes:
-
-* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
- `RegressionMetrics.explainedVariance` returns the average regression sum of squares.
-* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
- sorted.
-* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
- convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.
-
-In the `spark.ml` package, there exists one breaking API change and one behavior change:
-
-* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
- from `Params.setDefault` due to a
- [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
-* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
- added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.
-
-## From 1.3 to 1.4
-
-In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
-
-* Gradient-Boosted Trees
- * *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
- * *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
-* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
-
-In the `spark.ml` package, several major API changes occurred, including:
-
-* `Param` and other APIs for specifying parameters
-* `uid` unique IDs for Pipeline components
-* Reorganization of certain classes
-
-Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all changes here.
-However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API
-changes for future releases.
-
-## From 1.2 to 1.3
-
-In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
-
-* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed.
-* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
-* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD) remains an Experimental component. In it, there were two changes:
- * The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods.
- * Variable `model` is no longer public.
-* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) remains an Experimental component. In it and its associated classes, there were several changes:
- * In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.)
- * In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
-* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use.
-* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
- So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.
-
-In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:
-
-* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
-* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
-* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
-
-Other changes were in `LogisticRegression`:
-
-* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
-* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.
-
-## From 1.1 to 1.2
-
-The only API changes in MLlib v1.2 are in
-[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-which continues to be an experimental API in MLlib 1.2:
-
-1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number
-of classes. In MLlib v1.1, this argument was called `numClasses` in Python and
-`numClassesForClassification` in Scala. In MLlib v1.2, the names are both set to `numClasses`.
-This `numClasses` parameter is specified either via
-[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
-or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
-static `trainClassifier` and `trainRegressor` methods.
-
-2. *(Breaking change)* The API for
-[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has changed.
-This should generally not affect user code, unless the user manually constructs decision trees
-(instead of using the `trainClassifier` or `trainRegressor` methods).
-The tree `Node` now includes more information, including the probability of the predicted label
-(for classification).
-
-3. Printing methods' output has changed. The `toString` (Scala/Java) and `__repr__` (Python) methods used to print the full model; they now print a summary. For the full model, use `toDebugString`.
-
-Examples in the Spark distribution and examples in the
-[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated accordingly.
-
-## From 1.0 to 1.1
-
-The only API changes in MLlib v1.1 are in
-[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-which continues to be an experimental API in MLlib 1.1:
-
-1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match
-the implementations of trees in
-[scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
-and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
-In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf nodes.
-In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf nodes.
-This depth is specified by the `maxDepth` parameter in
-[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
-or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
-static `trainClassifier` and `trainRegressor` methods.
-
-2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor`
-methods to build a [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-rather than using the old parameter class `Strategy`. These new training methods explicitly
-separate classification and regression, and they replace specialized parameter types with
-simple `String` types.
-
-Examples of the new, recommended `trainClassifier` and `trainRegressor` are given in the
-[Decision Trees Guide](mllib-decision-tree.html#examples).
-
-## From 0.9 to 1.0
-
-In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
-breaking changes. If your data is sparse, please store it in a sparse format instead of dense to
-take advantage of sparsity in both storage and computation. Details are described below.
+The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
+Past migration guides are now stored at [ml-migration-guides.html](ml-migration-guides.html).
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index d0d594af6a..7471d18a0d 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Naive Bayes - spark.mllib
-displayTitle: Naive Bayes - spark.mllib
+title: Naive Bayes - RDD-based API
+displayTitle: Naive Bayes - RDD-based API
---
[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md
index f90b66f8e2..eefd7dcf11 100644
--- a/docs/mllib-optimization.md
+++ b/docs/mllib-optimization.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Optimization - spark.mllib
-displayTitle: Optimization - spark.mllib
+title: Optimization - RDD-based API
+displayTitle: Optimization - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md
index 7f2347dc0b..d353090870 100644
--- a/docs/mllib-pmml-model-export.md
+++ b/docs/mllib-pmml-model-export.md
@@ -1,7 +1,7 @@
---
layout: global
-title: PMML model export - spark.mllib
-displayTitle: PMML model export - spark.mllib
+title: PMML model export - RDD-based API
+displayTitle: PMML model export - RDD-based API
---
* Table of contents
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 329855e565..12797bd868 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Basic Statistics - spark.mllib
-displayTitle: Basic Statistics - spark.mllib
+title: Basic Statistics - RDD-based API
+displayTitle: Basic Statistics - RDD-based API
---
* Table of contents
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 2bc49120a0..888c12f186 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1571,7 +1571,7 @@ have changed from returning (key, list of values) pairs to (key, iterable of val
</div>
Migration guides are also available for [Spark Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x),
-[MLlib](mllib-guide.html#migration-guide) and [GraphX](graphx-programming-guide.html#migrating-from-spark-091).
+[MLlib](ml-guide.html#migration-guide) and [GraphX](graphx-programming-guide.html#migrating-from-spark-091).
# Where to Go from Here
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index 2ee3b80185..de82a064d1 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -15,7 +15,7 @@ like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex
algorithms expressed with high-level functions like `map`, `reduce`, `join` and `window`.
Finally, processed data can be pushed out to filesystems, databases,
and live dashboards. In fact, you can apply Spark's
-[machine learning](mllib-guide.html) and
+[machine learning](ml-guide.html) and
[graph processing](graphx-programming-guide.html) algorithms on data streams.
<p style="text-align: center;">
@@ -1673,7 +1673,7 @@ See the [DataFrames and SQL](sql-programming-guide.html) guide to learn more abo
***
## MLlib Operations
-You can also easily use machine learning algorithms provided by [MLlib](mllib-guide.html). First of all, there are streaming machine learning algorithms (e.g. [Streaming Linear Regression](mllib-linear-methods.html#streaming-linear-regression), [Streaming KMeans](mllib-clustering.html#streaming-k-means), etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the [MLlib](mllib-guide.html) guide for more details.
+You can also easily use machine learning algorithms provided by [MLlib](ml-guide.html). First of all, there are streaming machine learning algorithms (e.g. [Streaming Linear Regression](mllib-linear-methods.html#streaming-linear-regression), [Streaming KMeans](mllib-clustering.html#streaming-k-means), etc.) which can simultaneously learn from the streaming data as well as apply the model on the streaming data. Beyond these, for a much larger class of machine learning algorithms, you can learn a learning model offline (i.e. using historical data) and then apply the model online on streaming data. See the [MLlib](ml-guide.html) guide for more details.
***