From 26d35f3fd942761b0adecd1a720e1fa834db4de9 Mon Sep 17 00:00:00 2001 From: Xiangrui Meng Date: Tue, 22 Apr 2014 11:20:47 -0700 Subject: [SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0 Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc --- docs/mllib-basics.md | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 476 insertions(+) create mode 100644 docs/mllib-basics.md (limited to 'docs/mllib-basics.md') diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md new file mode 100644 index 0000000000..710ce1721f --- /dev/null +++ b/docs/mllib-basics.md @@ -0,0 +1,476 @@ +--- +layout: global +title: MLlib - Basics +--- + +* Table of contents +{:toc} + +MLlib supports local vectors and matrices stored on a single machine, +as well as distributed matrices backed by one or more RDDs. +In the current implementation, local vectors and matrices are simple data models +to serve public interfaces. The underly linear algebra operations are provided by +[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/). +A training example used in supervised learning is called "labeled point" in MLlib. + +## Local vector + +A local vector has integer-typed and 0-based indices and double-typed values, stored on a single +machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by +a double array representing its entry values, while a sparse vector is backed by two parallel +arrays: indices and values. For example, a vector $(1.0, 0.0, 3.0)$ can be represented in dense +format as `[1.0, 0.0, 3.0]` or in sparse format as `(3, [0, 2], [1.0, 3.0])`, where `3` is the size +of the vector. + +
+
+ +The base class of local vectors is +[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two +implementations: [`DenseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseVector) and +[`SparseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend +using the factory methods implemented in +[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +// Create a dense vector (1.0, 0.0, 3.0). +val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) +// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. +val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) +// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries. +val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))) +{% endhighlight %} + +***Note*** + +Scala imports `scala.collection.immutable.Vector` by default, so you have to import +`org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`. + +
+ +
+ +The base class of local vectors is +[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two +implementations: [`DenseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseVector) and +[`SparseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend +using the factory methods implemented in +[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors. + +{% highlight java %} +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.Vectors; + +// Create a dense vector (1.0, 0.0, 3.0). +Vector dv = Vectors.dense(1.0, 0.0, 3.0); +// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. +Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); +{% endhighlight %} +
+ +
+MLlib recognizes the following types as dense vectors: + +* NumPy's [`array`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) +* Python's list, e.g., `[1, 2, 3]` + +and the following as sparse vectors: + +* MLlib's [`SparseVector`](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html). +* SciPy's + [`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) + with a single column + +We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented +in [`Vectors`](api/pyspark/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors. + +{% highlight python %} +import numpy as np +import scipy.sparse as sps +from pyspark.mllib.linalg import Vectors + +# Use a NumPy array as a dense vector. +dv1 = np.array([1.0, 0.0, 3.0]) +# Use a Python list as a dense vector. +dv2 = [1.0, 0.0, 3.0] +# Create a SparseVector. +sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0]) +# Use a single-column SciPy csc_matrix as a sparse vector. +sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1)) +{% endhighlight %} + +
+
+ +## Labeled point + +A labeled point is a local vector, either dense or sparse, associated with a label/response. +In MLlib, labeled points are used in supervised learning algorithms. +We use a double to store a label, so we can use labeled points in both regression and classification. +For binary classification, label should be either $0$ (negative) or $1$ (positive). +For multiclass classification, labels should be class indices staring from zero: $0, 1, 2, \ldots$. + +
+ +
+ +A labeled point is represented by the case class +[`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint). + +{% highlight scala %} +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.regression.LabeledPoint + +// Create a labeled point with a positive label and a dense feature vector. +val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) + +// Create a labeled point with a negative label and a sparse feature vector. +val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) +{% endhighlight %} +
+ +
+ +A labeled point is represented by +[`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint). + +{% highlight java %} +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.mllib.regression.LabeledPoint; + +// Create a labeled point with a positive label and a dense feature vector. +LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)); + +// Create a labeled point with a negative label and a sparse feature vector. +LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0})); +{% endhighlight %} +
+ +
+ +A labeled point is represented by +[`LabeledPoint`](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html). + +{% highlight python %} +from pyspark.mllib.linalg import SparseVector +from pyspark.mllib.regression import LabeledPoint + +# Create a labeled point with a positive label and a dense feature vector. +pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) + +# Create a labeled point with a negative label and a sparse feature vector. +neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0])) +{% endhighlight %} +
+
+ +***Sparse data*** + +It is very common in practice to have sparse training data. MLlib supports reading training +examples stored in `LIBSVM` format, which is the default format used by +[`LIBSVM`](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and +[`LIBLINEAR`](http://www.csie.ntu.edu.tw/~cjlin/liblinear/). It is a text format. Each line +represents a labeled sparse feature vector using the following format: + +~~~ +label index1:value1 index2:value2 ... +~~~ + +where the indices are one-based and in ascending order. +After loading, the feature indices are converted to zero-based. + +
+
+ +[`MLUtils.loadLibSVMData`](api/mllib/index.html#org.apache.spark.mllib.util.MLUtils$) reads training +examples stored in LIBSVM format. + +{% highlight scala %} +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +val training: RDD[LabeledPoint] = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt") +{% endhighlight %} +
+ +
+[`MLUtils.loadLibSVMData`](api/mllib/index.html#org.apache.spark.mllib.util.MLUtils$) reads training +examples stored in LIBSVM format. + +{% highlight java %} +import org.apache.spark.mllib.regression.LabeledPoint; +import org.apache.spark.mllib.util.MLUtils; +import org.apache.spark.rdd.RDDimport; + +RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt") +{% endhighlight %} +
+
+ +## Local matrix + +A local matrix has integer-typed row and column indices and double-typed values, stored on a single +machine. MLlib supports dense matrix, whose entry values are stored in a single double array in +column major. For example, the following matrix `\[ \begin{pmatrix} +1.0 & 2.0 \\ +3.0 & 4.0 \\ +5.0 & 6.0 +\end{pmatrix} +\]` +is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the matrix size `(3, 2)`. +We are going to add sparse matrix in the next release. + +
+
+ +The base class of local matrices is +[`Matrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one +implementation: [`DenseMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseMatrix). +Sparse matrix will be added in the next release. We recommend using the factory methods implemented +in [`Matrices`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrices) to create local +matrices. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.{Matrix, Matrices} + +// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) +val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)) +{% endhighlight %} +
+ +
+ +The base class of local matrices is +[`Matrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one +implementation: [`DenseMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseMatrix). +Sparse matrix will be added in the next release. We recommend using the factory methods implemented +in [`Matrices`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrices) to create local +matrices. + +{% highlight java %} +import org.apache.spark.mllib.linalg.Matrix; +import org.apache.spark.mllib.linalg.Matrices; + +// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) +Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0}); +{% endhighlight %} +
+ +
+ +## Distributed matrix + +A distributed matrix has long-typed row and column indices and double-typed values, stored +distributively in one or more RDDs. It is very important to choose the right format to store large +and distributed matrices. Converting a distributed matrix to a different format may require a +global shuffle, which is quite expensive. We implemented three types of distributed matrices in +this release and will add more types in the future. + +***Note*** + +The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size. +It is always error-prone to have non-deterministic RDDs. + +### RowMatrix + +A `RowMatrix` is a row-oriented distributed matrix without meaningful row indices, backed by an RDD +of its rows, where each row is a local vector. This is similar to `data matrix` in the context of +multivariate statistics. Since each row is represented by a local vector, the number of columns is +limited by the integer range but it should be much smaller in practice. + +
+
+ +A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be +created from an `RDD[Vector]` instance. Then we can compute its column summary statistics. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.linalg.distributed.RowMatrix + +val rows: RDD[Vector] = ... // an RDD of local vectors +// Create a RowMatrix from an RDD[Vector]. +val mat: RowMatrix = new RowMatrix(rows) + +// Get its size. +val m = mat.numRows() +val n = mat.numCols() +{% endhighlight %} +
+ +
+ +A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be +created from a `JavaRDD` instance. Then we can compute its column summary statistics. + +{% highlight java %} +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.distributed.RowMatrix; + +JavaRDD rows = ... // a JavaRDD of local vectors +// Create a RowMatrix from an JavaRDD. +RowMatrix mat = new RowMatrix(rows.rdd()); + +// Get its size. +long m = mat.numRows(); +long n = mat.numCols(); +{% endhighlight %} +
+
+ +#### Multivariate summary statistics + +We provide column summary statistics for `RowMatrix`. +If the number of columns is not large, say, smaller than 3000, you can also compute +the covariance matrix as a local matrix, which requires $\mathcal{O}(n^2)$ storage where $n$ is the +number of columns. The total CPU time is $\mathcal{O}(m n^2)$, where $m$ is the number of rows, +which could be faster if the rows are sparse. + +
+
+ +`RowMatrix#computeColumnSummaryStatistics` returns an instance of +[`MultivariateStatisticalSummary`](api/mllib/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary), +which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the +total count. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.Matrix +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.mllib.stat.MultivariateStatisticalSummary + +val mat: RowMatrix = ... // a RowMatrix + +// Compute column summary statistics. +val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics() +println(summary.mean) // a dense vector containing the mean value for each column +println(summary.variance) // column-wise variance +println(summary.numNonzers) // number of nonzeros in each column + +// Compute the covariance matrix. +val Cov: Matrix = mat.computeCovariance() +{% endhighlight %} +
+
+ +### IndexedRowMatrix + +An `IndexedRowMatrix` is similar to a `RowMatrix` but with meaningful row indices. It is backed by +an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector. + +
+
+ +An +[`IndexedRowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) +can be created from an `RDD[IndexedRow]` instance, where +[`IndexedRow`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a +wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping +its row indices. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix} + +val rows: RDD[IndexedRow] = ... // an RDD of indexed rows +// Create an IndexedRowMatrix from an RDD[IndexedRow]. +val mat: IndexedRowMatrix = new IndexedRowMatrix(rows) + +// Get its size. +val m = mat.numRows() +val n = mat.numCols() + +// Drop its row indices. +val rowMat: RowMatrix = mat.toRowMatrix() +{% endhighlight %} +
+ +
+ +An +[`IndexedRowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) +can be created from an `JavaRDD` instance, where +[`IndexedRow`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a +wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping +its row indices. + +{% highlight java %} +import org.apache.spark.mllib.linalg.distributed.IndexedRow; +import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix; +import org.apache.spark.mllib.linalg.distributed.RowMatrix; + +JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows +// Create an IndexedRowMatrix from a JavaRDD. +IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd()); + +// Get its size. +long m = mat.numRows(); +long n = mat.numCols(); + +// Drop its row indices. +RowMatrix rowMat = mat.toRowMatrix(); +{% endhighlight %} +
+ +### CoordinateMatrix + +A `CoordinateMatrix` is a distributed matrix backed by an RDD of its entries. Each entry is a tuple +of `(i: Long, j: Long, value: Double)`, where `i` is the row index, `j` is the column index, and +`value` is the entry value. A `CoordinateMatrix` should be used only in the case when both +dimensions of the matrix are huge and the matrix is very sparse. + +
+
+ +A +[`CoordinateMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) +can be created from an `RDD[MatrixEntry]` instance, where +[`MatrixEntry`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a +wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix` +with sparse rows by calling `toIndexedRowMatrix`. In this release, we do not provide other +computation for `CoordinateMatrix`. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} + +val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries +// Create a CoordinateMatrix from an RDD[MatrixEntry]. +val mat: CoordinateMatrix = new CoordinateMatrix(entries) + +// Get its size. +val m = mat.numRows() +val n = mat.numCols() + +// Convert it to an IndexRowMatrix whose rows are sparse vectors. +val indexedRowMatrix = mat.toIndexedRowMatrix() +{% endhighlight %} +
+ +
+ +A +[`CoordinateMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) +can be created from a `JavaRDD` instance, where +[`MatrixEntry`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a +wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix` +with sparse rows by calling `toIndexedRowMatrix`. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix; +import org.apache.spark.mllib.linalg.distributed.MatrixEntry; + +JavaRDD entries = ... // a JavaRDD of matrix entries +// Create a CoordinateMatrix from a JavaRDD. +CoordinateMatrix mat = new CoordinateMatrix(entries); + +// Get its size. +long m = mat.numRows(); +long n = mat.numCols(); + +// Convert it to an IndexRowMatrix whose rows are sparse vectors. +IndexedRowMatrix indexedRowMatrix = mat.toIndexedRowMatrix(); +{% endhighlight %} +
+
-- cgit v1.2.3