aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAmeet Talwalkar <atalwalkar@gmail.com>2014-08-12 17:15:21 -0700
committerXiangrui Meng <meng@databricks.com>2014-08-12 17:15:21 -0700
commitc235b83e2782cce0626ecc403c0a67e442be52c1 (patch)
tree30b4ada17cba016cc2a8a7f01f09b7bcb78fbace
parent882da57a1c8c075a87909d516b169b624941a6ec (diff)
downloadspark-c235b83e2782cce0626ecc403c0a67e442be52c1.tar.gz
spark-c235b83e2782cce0626ecc403c0a67e442be52c1.tar.bz2
spark-c235b83e2782cce0626ecc403c0a67e442be52c1.zip
SPARK-2830 [MLlib]: re-organize mllib documentation
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation. Author: Ameet Talwalkar <atalwalkar@gmail.com> Closes #1908 from atalwalkar/master and squashes the following commits: fe6938a [Ameet Talwalkar] made xiangruis suggested changes 840028b [Ameet Talwalkar] made xiangruis suggested changes 7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
-rw-r--r--docs/mllib-basics.md117
-rw-r--r--docs/mllib-classification-regression.md37
-rw-r--r--docs/mllib-clustering.md15
-rw-r--r--docs/mllib-collaborative-filtering.md21
-rw-r--r--docs/mllib-dimensionality-reduction.md44
-rw-r--r--docs/mllib-feature-extraction.md12
-rw-r--r--docs/mllib-guide.md30
-rw-r--r--docs/mllib-linear-methods.md134
-rw-r--r--docs/mllib-naive-bayes.md32
-rw-r--r--docs/mllib-stats.md95
10 files changed, 317 insertions, 220 deletions
diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md
index f9585251fa..8752df4129 100644
--- a/docs/mllib-basics.md
+++ b/docs/mllib-basics.md
@@ -9,17 +9,17 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
MLlib supports local vectors and matrices stored on a single machine,
as well as distributed matrices backed by one or more RDDs.
-In the current implementation, local vectors and matrices are simple data models
-to serve public interfaces. The underlying linear algebra operations are provided by
+Local vectors and local matrices are simple data models
+that serve as public interfaces. The underlying linear algebra operations are provided by
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
-A training example used in supervised learning is called "labeled point" in MLlib.
+A training example used in supervised learning is called a "labeled point" in MLlib.
## Local vector
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single
machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by
a double array representing its entry values, while a sparse vector is backed by two parallel
-arrays: indices and values. For example, a vector $(1.0, 0.0, 3.0)$ can be represented in dense
+arrays: indices and values. For example, a vector `(1.0, 0.0, 3.0)` can be represented in dense
format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, where `3` is the size
of the vector.
@@ -44,8 +44,7 @@ val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
{% endhighlight %}
-***Note***
-
+***Note:***
Scala imports `scala.collection.immutable.Vector` by default, so you have to import
`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
@@ -110,8 +109,8 @@ sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])),
A labeled point is a local vector, either dense or sparse, associated with a label/response.
In MLlib, labeled points are used in supervised learning algorithms.
We use a double to store a label, so we can use labeled points in both regression and classification.
-For binary classification, label should be either $0$ (negative) or $1$ (positive).
-For multiclass classification, labels should be class indices staring from zero: $0, 1, 2, \ldots$.
+For binary classification, a label should be either `0` (negative) or `1` (positive).
+For multiclass classification, labels should be class indices starting from zero: `0, 1, 2, ...`.
<div class="codetabs">
@@ -172,7 +171,7 @@ neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
It is very common in practice to have sparse training data. MLlib supports reading training
examples stored in `LIBSVM` format, which is the default format used by
[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
-[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/). It is a text format. Each line
+[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/). It is a text format in which each line
represents a labeled sparse feature vector using the following format:
~~~
@@ -226,7 +225,7 @@ examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
## Local matrix
A local matrix has integer-typed row and column indices and double-typed values, stored on a single
-machine. MLlib supports dense matrix, whose entry values are stored in a single double array in
+machine. MLlib supports dense matrices, whose entry values are stored in a single double array in
column major. For example, the following matrix `\[ \begin{pmatrix}
1.0 & 2.0 \\
3.0 & 4.0 \\
@@ -234,7 +233,6 @@ column major. For example, the following matrix `\[ \begin{pmatrix}
\end{pmatrix}
\]`
is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)`.
-We are going to add sparse matrix in the next release.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -242,7 +240,7 @@ We are going to add sparse matrix in the next release.
The base class of local matrices is
[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
implementation: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
-Sparse matrix will be added in the next release. We recommend using the factory methods implemented
+We recommend using the factory methods implemented
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
matrices.
@@ -259,7 +257,7 @@ val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
The base class of local matrices is
[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide one
implementation: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html).
-Sparse matrix will be added in the next release. We recommend using the factory methods implemented
+We recommend using the factory methods implemented
in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local
matrices.
@@ -279,28 +277,30 @@ Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});
A distributed matrix has long-typed row and column indices and double-typed values, stored
distributively in one or more RDDs. It is very important to choose the right format to store large
and distributed matrices. Converting a distributed matrix to a different format may require a
-global shuffle, which is quite expensive. We implemented three types of distributed matrices in
-this release and will add more types in the future.
+global shuffle, which is quite expensive. Three types of distributed matrices have been implemented
+so far.
The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
matrix without meaningful row indices, e.g., a collection of feature vectors.
It is backed by an RDD of its rows, where each row is a local vector.
-We assume that the number of columns is not huge for a `RowMatrix`.
+We assume that the number of columns is not huge for a `RowMatrix` so that a single
+local vector can be reasonably communicated to the driver and can also be stored /
+operated on using a single node.
An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
-which can be used for identifying rows and joins.
-A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix) format,
+which can be used for identifying rows and executing joins.
+A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format,
backed by an RDD of its entries.
***Note***
The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
-It is always error-prone to have non-deterministic RDDs.
+In general the use of non-deterministic RDDs can lead to errors.
### RowMatrix
A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD
-of its rows, where each row is a local vector. This is similar to `data matrix` in the context of
-multivariate statistics. Since each row is represented by a local vector, the number of columns is
+of its rows, where each row is a local vector.
+Since each row is represented by a local vector, the number of columns is
limited by the integer range but it should be much smaller in practice.
<div class="codetabs">
@@ -344,70 +344,10 @@ long n = mat.numCols();
</div>
</div>
-#### Multivariate summary statistics
-
-We provide column summary statistics for `RowMatrix`.
-If the number of columns is not large, say, smaller than 3000, you can also compute
-the covariance matrix as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
-number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
-which could be faster if the rows are sparse.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-[`RowMatrix#computeColumnSummaryStatistics`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
-[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
-which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
-total count.
-
-{% highlight scala %}
-import org.apache.spark.mllib.linalg.Matrix
-import org.apache.spark.mllib.linalg.distributed.RowMatrix
-import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
-
-val mat: RowMatrix = ... // a RowMatrix
-
-// Compute column summary statistics.
-val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
-println(summary.mean) // a dense vector containing the mean value for each column
-println(summary.variance) // column-wise variance
-println(summary.numNonzeros) // number of nonzeros in each column
-
-// Compute the covariance matrix.
-val cov: Matrix = mat.computeCovariance()
-{% endhighlight %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
-[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
-which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
-total count.
-
-{% highlight java %}
-import org.apache.spark.mllib.linalg.Matrix;
-import org.apache.spark.mllib.linalg.distributed.RowMatrix;
-import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
-
-RowMatrix mat = ... // a RowMatrix
-
-// Compute column summary statistics.
-MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
-System.out.println(summary.mean()); // a dense vector containing the mean value for each column
-System.out.println(summary.variance()); // column-wise variance
-System.out.println(summary.numNonzeros()); // number of nonzeros in each column
-
-// Compute the covariance matrix.
-Matrix cov = mat.computeCovariance();
-{% endhighlight %}
-</div>
-</div>
-
### IndexedRowMatrix
An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices. It is backed by
-an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector.
+an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local vector.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -467,7 +407,7 @@ RowMatrix rowMat = mat.toRowMatrix();
A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries. Each entry is a tuple
of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the column index, and
-`value` is the entry value. A `CoordinateMatrix` should be used only in the case when both
+`value` is the entry value. A `CoordinateMatrix` should be used only when both
dimensions of the matrix are huge and the matrix is very sparse.
<div class="codetabs">
@@ -477,9 +417,9 @@ A
[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
can be created from an `RDD[MatrixEntry]` instance, where
[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
-wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
-with sparse rows by calling `toIndexedRowMatrix`. In this release, we do not provide other
-computation for `CoordinateMatrix`.
+wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
+with sparse rows by calling `toIndexedRowMatrix`. Other computations for
+`CoordinateMatrix` are not currently supported.
{% highlight scala %}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
@@ -503,8 +443,9 @@ A
[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
can be created from a `JavaRDD<MatrixEntry>` instance, where
[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
-wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
-with sparse rows by calling `toIndexedRowMatrix`.
+wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
+with sparse rows by calling `toIndexedRowMatrix`. Other computations for
+`CoordinateMatrix` are not currently supported.
{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md
new file mode 100644
index 0000000000..719cc95767
--- /dev/null
+++ b/docs/mllib-classification-regression.md
@@ -0,0 +1,37 @@
+---
+layout: global
+title: Classification and Regression - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Classification and Regression
+---
+
+MLlib supports various methods for
+[binary classification](http://en.wikipedia.org/wiki/Binary_classification),
+[multiclass
+classification](http://en.wikipedia.org/wiki/Multiclass_classification), and
+[regression analysis](http://en.wikipedia.org/wiki/Regression_analysis). The table below outlines
+the supported algorithms for each type of problem.
+
+<table class="table">
+ <thead>
+ <tr><th>Problem Type</th><th>Supported Methods</th></tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>Binary Classification</td><td>linear SVMs, logistic regression, decision trees, naive Bayes</td>
+ </tr>
+ <tr>
+ <td>Multiclass Classification</td><td>decision trees, naive Bayes</td>
+ </tr>
+ <tr>
+ <td>Regression</td><td>linear least squares, Lasso, ridge regression, decision trees</td>
+ </tr>
+ </tbody>
+</table>
+
+More details for these methods can be found here:
+
+* [Linear models](mllib-linear-methods.html)
+ * [binary classification (SVMs, logistic regression)](mllib-linear-methods.html#binary-classification)
+ * [linear regression (least squares, Lasso, ridge)](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
+* [Decision trees](mllib-decision-tree.html)
+* [Naive Bayes](mllib-naive-bayes.html)
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 561de48910..dfd9cd5728 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -38,7 +38,7 @@ a given dataset, the algorithm returns the best clustering result).
<div class="codetabs">
<div data-lang="scala" markdown="1">
-Following code snippets can be executed in `spark-shell`.
+The following code snippets can be executed in `spark-shell`.
In the following example after loading and parsing data, we use the
[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
@@ -70,7 +70,7 @@ All of MLlib's methods use Java-friendly types, so you can import and call them
way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
calling `.rdd()` on your `JavaRDD` object. A standalone application example
-that is equivalent to the provided example in Scala is given bellow:
+that is equivalent to the provided example in Scala is given below:
{% highlight java %}
import org.apache.spark.api.java.*;
@@ -113,14 +113,15 @@ public class KMeansExample {
}
{% endhighlight %}
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
</div>
<div data-lang="python" markdown="1">
-Following examples can be tested in the PySpark shell.
+The following examples can be tested in the PySpark shell.
In the following example after loading and parsing data, we use the KMeans object to cluster the
data into two clusters. The number of desired clusters is passed to the algorithm. We then compute
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index 0d28b5f7c8..ab10b2f01f 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -14,13 +14,13 @@ is commonly used for recommender systems. These techniques aim to fill in the
missing entries of a user-item association matrix. MLlib currently supports
model-based collaborative filtering, in which users and products are described
by a small set of latent factors that can be used to predict missing entries.
-In particular, we implement the [alternating least squares
+MLlib uses the [alternating least squares
(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
algorithm to learn these latent factors. The implementation in MLlib has the
following parameters:
* *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure).
-* *rank* is the number of latent factors in our model.
+* *rank* is the number of latent factors in the model.
* *iterations* is the number of iterations to run.
* *lambda* specifies the regularization parameter in ALS.
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
@@ -86,8 +86,8 @@ val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
println("Mean Squared Error = " + MSE)
{% endhighlight %}
-If the rating matrix is derived from other source of information (i.e., it is inferred from
-other signals), you can use the trainImplicit method to get better results.
+If the rating matrix is derived from another source of information (e.g., it is inferred from
+other signals), you can use the `trainImplicit` method to get better results.
{% highlight scala %}
val alpha = 0.01
@@ -174,10 +174,11 @@ public class CollaborativeFiltering {
}
{% endhighlight %}
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
</div>
<div data-lang="python" markdown="1">
@@ -219,5 +220,5 @@ model = ALS.trainImplicit(ratings, rank, numIterations, alpha = 0.01)
## Tutorial
-[AMP Camp](http://ampcamp.berkeley.edu/) provides a hands-on tutorial for
-[personalized movie recommendation with MLlib](http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html).
+The [training exercises](https://databricks-training.s3.amazonaws.com/index.html) from the Spark Summit 2014 include a hands-on tutorial for
+[personalized movie recommendation with MLlib](https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html).
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
index 8e434998c1..065d646496 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -9,9 +9,9 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
[Dimensionality reduction](http://en.wikipedia.org/wiki/Dimensionality_reduction) is the process
of reducing the number of variables under consideration.
-It is used to extract latent features from raw and noisy features,
+It can be used to extract latent features from raw and noisy features
or compress data while maintaining the structure.
-In this release, we provide preliminary support for dimensionality reduction on tall-and-skinny matrices.
+MLlib provides support for dimensionality reduction on tall-and-skinny matrices.
## Singular value decomposition (SVD)
@@ -30,17 +30,17 @@ where
* $V$ is an orthonormal matrix, whose columns are called right singular vectors.
For large matrices, usually we don't need the complete factorization but only the top singular
-values and its associated singular vectors. This can save storage, and more importantly, de-noise
+values and its associated singular vectors. This can save storage, de-noise
and recover the low-rank structure of the matrix.
-If we keep the top $k$ singular values, then the dimensions of the return will be:
+If we keep the top $k$ singular values, then the dimensions of the resulting low-rank matrix will be:
* `$U$`: `$m \times k$`,
* `$\Sigma$`: `$k \times k$`,
* `$V$`: `$n \times k$`.
-In this release, we provide SVD computation to row-oriented matrices that have only a few columns,
-say, less than $1000$, but many rows, which we call *tall-and-skinny*.
+MLlib provides SVD functionality to row-oriented matrices that have only a few columns,
+say, less than $1000$, but many rows, i.e., *tall-and-skinny* matrices.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -58,15 +58,10 @@ val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
{% endhighlight %}
-Same code applies to `IndexedRowMatrix`.
-The only difference that the `U` matrix becomes an `IndexedRowMatrix`.
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
</div>
<div data-lang="java" markdown="1">
-In order to run the following standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
-
{% highlight java %}
import java.util.LinkedList;
@@ -104,8 +99,16 @@ public class SVD {
}
}
{% endhighlight %}
-Same code applies to `IndexedRowMatrix`.
-The only difference that the `U` matrix becomes an `IndexedRowMatrix`.
+
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
+
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
+
</div>
</div>
@@ -116,7 +119,7 @@ statistical method to find a rotation such that the first coordinate has the lar
possible, and each succeeding coordinate in turn has the largest variance possible. The columns of
the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.
-In this release, we implement PCA for tall-and-skinny matrices stored in row-oriented format.
+MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -180,9 +183,10 @@ public class PCA {
}
{% endhighlight %}
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
</div>
</div>
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
new file mode 100644
index 0000000000..21453cb9cd
--- /dev/null
+++ b/docs/mllib-feature-extraction.md
@@ -0,0 +1,12 @@
+---
+layout: global
+title: Feature Extraction - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction
+---
+
+* Table of contents
+{:toc}
+
+## Word2Vec
+
+## TFIDF
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 95ee6bc968..23d5a0c460 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -3,18 +3,19 @@ layout: global
title: Machine Learning Library (MLlib)
---
-MLlib is a Spark implementation of some common machine learning algorithms and utilities,
+MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
including classification, regression, clustering, collaborative
-filtering, dimensionality reduction, as well as underlying optimization primitives:
+filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
-* [Basics](mllib-basics.html)
- * data types
+* [Data types](mllib-basics.html)
+* [Basic statistics](mllib-stats.html)
+ * data generators
+ * stratified sampling
* summary statistics
-* Classification and regression
- * [linear support vector machine (SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm)
- * [logistic regression](mllib-linear-methods.html#logistic-regression)
- * [linear least squares, Lasso, and ridge regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
- * [decision tree](mllib-decision-tree.html)
+ * hypothesis testing
+* [Classification and regression](mllib-classification-regression.html)
+ * [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
+ * [decision trees](mllib-decision-tree.html)
* [naive Bayes](mllib-naive-bayes.html)
* [Collaborative filtering](mllib-collaborative-filtering.html)
* alternating least squares (ALS)
@@ -23,17 +24,18 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv
* [Dimensionality reduction](mllib-dimensionality-reduction.html)
* singular value decomposition (SVD)
* principal component analysis (PCA)
-* [Optimization](mllib-optimization.html)
+* [Feature extraction and transformation](mllib-feature-extraction.html)
+* [Optimization (developer)](mllib-optimization.html)
* stochastic gradient descent
* limited-memory BFGS (L-BFGS)
-MLlib is a new component under active development.
+MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
-and we will provide migration guide between releases.
+and the migration guide below will explain all changes between releases.
# Dependencies
-MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), which depends on
+MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
[netlib-java](https://github.com/fommil/netlib-java), and
[jblas](https://github.com/mikiobraun/jblas).
`netlib-java` and `jblas` depend on native Fortran routines.
@@ -56,7 +58,7 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4
In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
breaking changes. If your data is sparse, please store it in a sparse format instead of dense to
-take advantage of sparsity in both storage and computation.
+take advantage of sparsity in both storage and computation. Details are described below.
<div class="codetabs">
<div data-lang="scala" markdown="1">
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 254201147e..e504cd7f0f 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -33,24 +33,24 @@ the task of finding a minimizer of a convex function `$f$` that depends on a var
Formally, we can write this as the optimization problem `$\min_{\wv \in\R^d} \; f(\wv)$`, where
the objective function is of the form
`\begin{equation}
- f(\wv) :=
- \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
- \lambda\, R(\wv_i)
+ f(\wv) := \lambda\, R(\wv) +
+ \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)
\label{eq:regPrimal}
\ .
\end{equation}`
Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le i\le n$`, and
`$y_i\in\R$` are their corresponding labels, which we want to predict.
We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function of $\wv^T x$ and $y$.
-Several MLlib's classification and regression algorithms fall into this category,
+Several of MLlib's classification and regression algorithms fall into this category,
and are discussed here.
The objective function `$f$` has two parts:
-the loss that measures the error of the model on the training data,
-and the regularizer that measures the complexity of the model.
-The loss function `$L(\wv;.)$` must be a convex function in `$\wv$`.
-The fixed regularization parameter `$\lambda \ge 0$` (`regParam` in the code) defines the trade-off
-between the two goals of small loss and small model complexity.
+the regularizer that controls the complexity of the model,
+and the loss that measures the error of the model on the training data.
+The loss function `$L(\wv;.)$` is typically a convex function in `$\wv$`. The
+fixed regularization parameter `$\lambda \ge 0$` (`regParam` in the code)
+defines the trade-off between the two goals of minimizing the loss (i.e.,
+training error) and minimizing model complexity (i.e., to avoid overfitting).
### Loss functions
@@ -80,10 +80,10 @@ methods MLlib supports:
### Regularizers
-The purpose of the [regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
-encourage simple models, by punishing the complexity of the model `$\wv$`, in order to e.g. avoid
-over-fitting.
-We support the following regularizers in MLlib:
+The purpose of the
+[regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
+encourage simple models and avoid overfitting. We support the following
+regularizers in MLlib:
<table class="table">
<thead>
@@ -106,27 +106,28 @@ Here `$\mathrm{sign}(\wv)$` is the vector consisting of the signs (`$\pm1$`) of
of `$\wv$`.
L2-regularized problems are generally easier to solve than L1-regularized due to smoothness.
-However, L1 regularization can help promote sparsity in weights, leading to simpler models, which is
-also used for feature selection. It is not recommended to train models without any regularization,
+However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection.
+It is not recommended to train models without any regularization,
especially when the number of training examples is small.
## Binary classification
-[Binary classification](http://en.wikipedia.org/wiki/Binary_classification) is to divide items into
-two categories: positive and negative. MLlib supports two linear methods for binary classification:
-linear support vector machine (SVM) and logistic regression. The training data set is represented
-by an RDD of [LabeledPoint](mllib-data-types.html) in MLlib. Note that, in the mathematical
-formulation, a training label $y$ is either $+1$ (positive) or $-1$ (negative), which is convenient
-for the formulation. *However*, the negative label is represented by $0$ in MLlib instead of $-1$,
-to be consistent with multiclass labeling.
+[Binary classification](http://en.wikipedia.org/wiki/Binary_classification)
+aims to divide items into two categories: positive and negative. MLlib
+supports two linear methods for binary classification: linear support vector
+machines (SVMs) and logistic regression. For both methods, MLlib supports
+L1 and L2 regularized variants. The training data set is represented by an RDD
+of [LabeledPoint](mllib-data-types.html) in MLlib. Note that, in the
+mathematical formulation in this guide, a training label $y$ is denoted as
+either $+1$ (positive) or $-1$ (negative), which is convenient for the
+formulation. *However*, the negative label is represented by $0$ in MLlib
+instead of $-1$, to be consistent with multiclass labeling.
-### Linear support vector machine (SVM)
+### Linear support vector machines (SVMs)
The [linear SVM](http://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM)
-has become a standard choice for large-scale classification tasks.
-The name "linear SVM" is actually ambiguous.
-By "linear SVM", we mean specifically the linear method with the loss function in formulation
-`$\eqref{eq:regPrimal}$` given by the hinge loss
+is a standard method for large-scale classification tasks. It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss function in the formulation given by the hinge loss:
+
`\[
L(\wv;\x,y) := \max \{0, 1-y \wv^T \x \}.
\]`
@@ -134,39 +135,44 @@ By default, linear SVMs are trained with an L2 regularization.
We also support alternative L1 regularization. In this case,
the problem becomes a [linear program](http://en.wikipedia.org/wiki/Linear_programming).
-Linear SVM algorithm outputs a SVM model, which makes predictions based on the value of $\wv^T \x$.
-By the default, if $\wv^T \x \geq 0$, the outcome is positive, or negative otherwise.
-However, quite often in practice, the default threshold $0$ is not a good choice.
-The threshold should be determined via model evaluation.
+The linear SVMs algorithm outputs an SVM model. Given a new data point,
+denoted by $\x$, the model makes predictions based on the value of $\wv^T \x$.
+By the default, if $\wv^T \x \geq 0$ then the outcome is positive, and negative
+otherwise.
### Logistic regression
[Logistic regression](http://en.wikipedia.org/wiki/Logistic_regression) is widely used to predict a
-binary response. It is a linear method with the loss function in formulation
-`$\eqref{eq:regPrimal}$` given by the logistic loss
+binary response.
+It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss
+function in the formulation given by the logistic loss:
`\[
L(\wv;\x,y) := \log(1+\exp( -y \wv^T \x)).
\]`
-Logistic regression algorithm outputs a logistic regression model, which makes predictions by
+The logistic regression algorithm outputs a logistic regression model. Given a
+new data point, denoted by $\x$, the model makes predictions by
applying the logistic function
`\[
\mathrm{f}(z) = \frac{1}{1 + e^{-z}}
\]`
where $z = \wv^T \x$.
-By default, if $\mathrm{f}(\wv^T x) > 0.5$, the outcome is positive, or negative otherwise.
-For the same reason mentioned above, quite often in practice, this default threshold is not a good choice.
-The threshold should be determined via model evaluation.
+By default, if $\mathrm{f}(\wv^T x) > 0.5$, the outcome is positive, or
+negative otherwise, though unlike linear SVMs, the raw output of the logistic regression
+model, $\mathrm{f}(z)$, has a probabilistic interpretation (i.e., the probability
+that $\x$ is positive).
### Evaluation metrics
-MLlib supports common evaluation metrics for binary classification (not available in Python). This
+MLlib supports common evaluation metrics for binary classification (not available in PySpark).
+This
includes precision, recall, [F-measure](http://en.wikipedia.org/wiki/F1_score),
[receiver operating characteristic (ROC)](http://en.wikipedia.org/wiki/Receiver_operating_characteristic),
precision-recall curve, and
[area under the curves (AUC)](http://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).
-Among the metrics, area under ROC is commonly used to compare models and precision/recall/F-measure
-can help determine the threshold to use.
+AUC is commonly used to compare the performance of various models while
+precision/recall/F-measure can help determine the appropriate threshold to use
+for prediction purposes.
### Examples
@@ -233,8 +239,7 @@ svmAlg.optimizer.
val modelL1 = svmAlg.run(training)
{% endhighlight %}
-Similarly, you can use replace `SVMWithSGD` by
-[`LogisticRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD).
+[`LogisticRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD) can be used in a similar fashion as `SVMWithSGD`.
</div>
@@ -318,10 +323,11 @@ svmAlg.optimizer()
final SVMModel modelL1 = svmAlg.run(training.rdd());
{% endhighlight %}
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
</div>
<div data-lang="python" markdown="1">
@@ -354,24 +360,22 @@ print("Training Error = " + str(trainErr))
## Linear least squares, Lasso, and ridge regression
-Linear least squares is a family of linear methods with the loss function in formulation
-`$\eqref{eq:regPrimal}$` given by the squared loss
+Linear least squares is the most common formulation for regression problems.
+It is a linear method as described above in equation `$\eqref{eq:regPrimal}$`, with the loss
+function in the formulation given by the squared loss:
`\[
L(\wv;\x,y) := \frac{1}{2} (\wv^T \x - y)^2.
\]`
-Depending on the regularization type, we call the method
-[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or simply
-[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) if there
-is no regularization, [*ridge regression*](http://en.wikipedia.org/wiki/Ridge_regression) if L2
-regularization is used, and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) if L1
-regularization is used. This average loss $\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$ is also
+Various related regression methods are derived by using different types of regularization:
+[*ordinary least squares*](http://en.wikipedia.org/wiki/Ordinary_least_squares) or
+[*linear least squares*](http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)) uses
+ no regularization; [*ridge regression*](http://en.wikipedia.org/wiki/Ridge_regression) uses L2
+regularization; and [*Lasso*](http://en.wikipedia.org/wiki/Lasso_(statistics)) uses L1
+regularization. For all of these models, the average loss or training error, $\frac{1}{n} \sum_{i=1}^n (\wv^T x_i - y_i)^2$, is
known as the [mean squared error](http://en.wikipedia.org/wiki/Mean_squared_error).
-Note that the squared loss is sensitive to outliers.
-Regularization or a robust alternative (e.g., $\ell_1$ regression) is usually necessary in practice.
-
### Examples
<div class="codetabs">
@@ -379,7 +383,7 @@ Regularization or a robust alternative (e.g., $\ell_1$ regression) is usually ne
<div data-lang="scala" markdown="1">
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
-values. We compute the Mean Squared Error at the end to evaluate
+values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
{% highlight scala %}
@@ -407,9 +411,8 @@ val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
{% endhighlight %}
-Similarly you can use
[`RidgeRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
-and [`LassoWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD).
+and [`LassoWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD) can be used in a similar fashion as `LinearRegressionWithSGD`.
</div>
@@ -479,16 +482,17 @@ public class LinearRegression {
}
{% endhighlight %}
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
</div>
<div data-lang="python" markdown="1">
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint.
The example then uses LinearRegressionWithSGD to build a simple linear model to predict label
-values. We compute the Mean Squared Error at the end to evaluate
+values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
{% highlight python %}
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index b1650c83c9..86d94aebd9 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -4,23 +4,23 @@ title: Naive Bayes - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Naive Bayes
---
-Naive Bayes is a simple multiclass classification algorithm with the assumption of independence
-between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to
-the training data, it computes the conditional probability distribution of each feature given label,
-and then it applies Bayes' theorem to compute the conditional probability distribution of label
-given an observation and use it for prediction. For more details, please visit the Wikipedia page
-[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).
-
-In MLlib, we implemented multinomial naive Bayes, which is typically used for document
-classification. Within that context, each observation is a document, each feature represents a term,
-whose value is the frequency of the term. For its formulation, please visit the Wikipedia page
-[Multinomial Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
-or the section
-[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
-from the book Introduction to Information
-Retrieval. [Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
+[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
+multiclass classification algorithm with the assumption of independence between
+every pair of features. Naive Bayes can be trained very efficiently. Within a
+single pass to the training data, it computes the conditional probability
+distribution of each feature given label, and then it applies Bayes' theorem to
+compute the conditional probability distribution of label given an observation
+and use it for prediction.
+
+MLlib supports [multinomial naive
+Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes),
+which is typically used for [document
+classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).
+Within that context, each observation is a document and each
+feature represents a term whose value is the frequency of the term.
+[Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
-vectors are usually sparse. Please supply sparse vectors as input to take advantage of
+vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of
sparsity. Since the training data is only used once, it is not necessary to cache it.
## Examples
diff --git a/docs/mllib-stats.md b/docs/mllib-stats.md
new file mode 100644
index 0000000000..ca9ef46c15
--- /dev/null
+++ b/docs/mllib-stats.md
@@ -0,0 +1,95 @@
+---
+layout: global
+title: Statistics Functionality - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
+---
+
+* Table of contents
+{:toc}
+
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+## Data Generators
+
+## Stratified Sampling
+
+## Summary Statistics
+
+### Multivariate summary statistics
+
+We provide column summary statistics for `RowMatrix` (note: this functionality is not currently supported in `IndexedRowMatrix` or `CoordinateMatrix`).
+If the number of columns is not large, e.g., on the order of thousands, then the
+covariance matrix can also be computed as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the
+number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows,
+and is faster if the rows are sparse.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[`computeColumnSummaryStatistics()`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
+[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
+which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
+total count.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
+
+val mat: RowMatrix = ... // a RowMatrix
+
+// Compute column summary statistics.
+val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
+println(summary.mean) // a dense vector containing the mean value for each column
+println(summary.variance) // column-wise variance
+println(summary.numNonzeros) // number of nonzeros in each column
+
+// Compute the covariance matrix.
+val cov: Matrix = mat.computeCovariance()
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
+[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
+which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
+total count.
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Matrix;
+import org.apache.spark.mllib.linalg.distributed.RowMatrix;
+import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
+
+RowMatrix mat = ... // a RowMatrix
+
+// Compute column summary statistics.
+MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
+System.out.println(summary.mean()); // a dense vector containing the mean value for each column
+System.out.println(summary.variance()); // column-wise variance
+System.out.println(summary.numNonzeros()); // number of nonzeros in each column
+
+// Compute the covariance matrix.
+Matrix cov = mat.computeCovariance();
+{% endhighlight %}
+</div>
+</div>
+
+
+## Hypothesis Testing