diff options
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r-- | docs/mllib-guide.md | 172 |
1 files changed, 115 insertions, 57 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 0963a99881..c49f857d07 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -3,63 +3,121 @@ layout: global title: Machine Learning Library (MLlib) --- +MLlib is a Spark implementation of some common machine learning algorithms and utilities, +including classification, regression, clustering, collaborative +filtering, dimensionality reduction, as well as underlying optimization primitives: -MLlib is a Spark implementation of some common machine learning (ML) -functionality, as well associated tests and data generators. MLlib -currently supports four common types of machine learning problem settings, -namely classification, regression, clustering and collaborative filtering, -as well as an underlying gradient descent optimization primitive and several -linear algebra methods. - -# Available Methods -The following links provide a detailed explanation of the methods and usage examples for each of them: - -* <a href="mllib-classification-regression.html">Classification and Regression</a> - * Binary Classification - * SVM (L1 and L2 regularized) - * Logistic Regression (L1 and L2 regularized) - * Linear Regression - * Least Squares - * Lasso - * Ridge Regression - * Decision Tree (for classification and regression) -* <a href="mllib-clustering.html">Clustering</a> - * k-Means -* <a href="mllib-collaborative-filtering.html">Collaborative Filtering</a> - * Matrix Factorization using Alternating Least Squares -* <a href="mllib-optimization.html">Optimization</a> - * Gradient Descent and Stochastic Gradient Descent -* <a href="mllib-linear-algebra.html">Linear Algebra</a> - * Singular Value Decomposition - * Principal Component Analysis - -# Data Types - -Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the -[Vector](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) class is used to -represent vectors. You can create either dense or sparse vectors using the -[Vectors](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) factory. - -In Python, MLlib can take the following vector types: - -* [NumPy](http://www.numpy.org) arrays -* Standard Python lists (e.g. `[1, 2, 3]`) -* The MLlib [SparseVector](api/python/pyspark.mllib.linalg.SparseVector-class.html) class -* [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html) - -For efficiency, we recommend using NumPy arrays over lists, and using the -[CSC format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) -for SciPy matrices, or MLlib's own SparseVector class. - -Several other simple data types are used throughout the library, e.g. the LabeledPoint -class ([Java/Scala](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint), -[Python](api/python/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data. - -# Dependencies -MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself -depends on native Fortran routines. You may need to install the -[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries) -if it is not already present on your nodes. MLlib will throw a linking error if it cannot -detect these libraries automatically. +* [Basics](mllib-basics.html) + * data types + * summary statistics +* Classification and regression + * [linear support vector machine (SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm) + * [logistic regression](mllib-linear-methods.html#logistic-regression) + * [linear least squares, Lasso, and ridge regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression) + * [decision tree](mllib-decision-tree.html) + * [naive Bayes](mllib-naive-bayes.html) +* [Collaborative filtering](mllib-collaborative-filtering.html) + * alternating least squares (ALS) +* [Clustering](mllib-clustering.html) + * k-means +* [Dimensionality reduction](mllib-dimensionality-reduction.html) + * singular value decomposition (SVD) + * principal component analysis (PCA) +* [Optimization](mllib-optimization.html) + * stochastic gradient descent + * limited-memory BFGS (L-BFGS) + +MLlib is currently a *beta* component under active development. +The APIs may change in the future releases, and we will provide migration guide between releases. + +## Dependencies + +MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), which depends on +[netlib-java](https://github.com/fommil/netlib-java), and +[jblas](https://github.com/mikiobraun/jblas). +`netlib-java` and `jblas` depend on native Fortran routines. +You need to install the +[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries) if it is not +already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries +automatically. Due to license issues, we do not include `netlib-java`'s native libraries in MLlib's +dependency set. If no native library is available at runtime, you will see a warning message. To +use native libraries from `netlib-java`, please include artifact +`com.github.fommil.netlib:all:1.1.2` as a dependency of your project or build your own (see +[instructions](https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries)). To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer. + +--- + +## Migration guide + +### From 0.9 to 1.0 + +In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few +breaking changes. If your data is sparse, please store it in a sparse format instead of dense to +take advantage of sparsity in both storage and computation. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +We used to represent a feature vector by `Array[Double]`, which is replaced by +[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) in v1.0. Algorithms that used +to accept `RDD[Array[Double]]` now take +`RDD[Vector]`. [`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) +is now a wrapper of `(Double, Vector)` instead of `(Double, Array[Double])`. Converting +`Array[Double]` to `Vector` is straightforward: + +{% highlight scala %} +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +val array: Array[Double] = ... // a double array +val vector: Vector = Vectors.dense(array) // a dense vector +{% endhighlight %} + +[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to create sparse vectors. + +*Note*. Scala imports `scala.collection.immutable.Vector` by default, so you have to import `org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`. + +</div> + +<div data-lang="java" markdown="1"> + +We used to represent a feature vector by `double[]`, which is replaced by +[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) in v1.0. Algorithms that used +to accept `RDD<double[]>` now take +`RDD<Vector>`. [`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) +is now a wrapper of `(double, Vector)` instead of `(double, double[])`. Converting `double[]` to +`Vector` is straightforward: + +{% highlight java %} +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.Vectors; + +double[] array = ... // a double array +Vector vector = Vectors.dense(array) // a dense vector +{% endhighlight %} + +[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to +create sparse vectors. + +</div> + +<div data-lang="python" markdown="1"> + +We used to represent a labeled feature vector in a NumPy array, where the first entry corresponds to +the label and the rest are features. This representation is replaced by class +[`LabeledPoint`](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html), which takes both +dense and sparse feature vectors. + +{% highlight python %} +from pyspark.mllib.linalg import SparseVector +from pyspark.mllib.regression import LabeledPoint + +# Create a labeled point with a positive label and a dense feature vector. +pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) + +# Create a labeled point with a negative label and a sparse feature vector. +neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0])) +{% endhighlight %} +</div> +</div> |