aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-guide.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r--docs/mllib-guide.md172
1 files changed, 115 insertions, 57 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 0963a99881..c49f857d07 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -3,63 +3,121 @@ layout: global
title: Machine Learning Library (MLlib)
---
+MLlib is a Spark implementation of some common machine learning algorithms and utilities,
+including classification, regression, clustering, collaborative
+filtering, dimensionality reduction, as well as underlying optimization primitives:
-MLlib is a Spark implementation of some common machine learning (ML)
-functionality, as well associated tests and data generators. MLlib
-currently supports four common types of machine learning problem settings,
-namely classification, regression, clustering and collaborative filtering,
-as well as an underlying gradient descent optimization primitive and several
-linear algebra methods.
-
-# Available Methods
-The following links provide a detailed explanation of the methods and usage examples for each of them:
-
-* <a href="mllib-classification-regression.html">Classification and Regression</a>
- * Binary Classification
- * SVM (L1 and L2 regularized)
- * Logistic Regression (L1 and L2 regularized)
- * Linear Regression
- * Least Squares
- * Lasso
- * Ridge Regression
- * Decision Tree (for classification and regression)
-* <a href="mllib-clustering.html">Clustering</a>
- * k-Means
-* <a href="mllib-collaborative-filtering.html">Collaborative Filtering</a>
- * Matrix Factorization using Alternating Least Squares
-* <a href="mllib-optimization.html">Optimization</a>
- * Gradient Descent and Stochastic Gradient Descent
-* <a href="mllib-linear-algebra.html">Linear Algebra</a>
- * Singular Value Decomposition
- * Principal Component Analysis
-
-# Data Types
-
-Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the
-[Vector](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) class is used to
-represent vectors. You can create either dense or sparse vectors using the
-[Vectors](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) factory.
-
-In Python, MLlib can take the following vector types:
-
-* [NumPy](http://www.numpy.org) arrays
-* Standard Python lists (e.g. `[1, 2, 3]`)
-* The MLlib [SparseVector](api/python/pyspark.mllib.linalg.SparseVector-class.html) class
-* [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html)
-
-For efficiency, we recommend using NumPy arrays over lists, and using the
-[CSC format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
-for SciPy matrices, or MLlib's own SparseVector class.
-
-Several other simple data types are used throughout the library, e.g. the LabeledPoint
-class ([Java/Scala](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint),
-[Python](api/python/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data.
-
-# Dependencies
-MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself
-depends on native Fortran routines. You may need to install the
-[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries)
-if it is not already present on your nodes. MLlib will throw a linking error if it cannot
-detect these libraries automatically.
+* [Basics](mllib-basics.html)
+ * data types
+ * summary statistics
+* Classification and regression
+ * [linear support vector machine (SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm)
+ * [logistic regression](mllib-linear-methods.html#logistic-regression)
+ * [linear least squares, Lasso, and ridge regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
+ * [decision tree](mllib-decision-tree.html)
+ * [naive Bayes](mllib-naive-bayes.html)
+* [Collaborative filtering](mllib-collaborative-filtering.html)
+ * alternating least squares (ALS)
+* [Clustering](mllib-clustering.html)
+ * k-means
+* [Dimensionality reduction](mllib-dimensionality-reduction.html)
+ * singular value decomposition (SVD)
+ * principal component analysis (PCA)
+* [Optimization](mllib-optimization.html)
+ * stochastic gradient descent
+ * limited-memory BFGS (L-BFGS)
+
+MLlib is currently a *beta* component under active development.
+The APIs may change in the future releases, and we will provide migration guide between releases.
+
+## Dependencies
+
+MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), which depends on
+[netlib-java](https://github.com/fommil/netlib-java), and
+[jblas](https://github.com/mikiobraun/jblas).
+`netlib-java` and `jblas` depend on native Fortran routines.
+You need to install the
+[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries) if it is not
+already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries
+automatically. Due to license issues, we do not include `netlib-java`'s native libraries in MLlib's
+dependency set. If no native library is available at runtime, you will see a warning message. To
+use native libraries from `netlib-java`, please include artifact
+`com.github.fommil.netlib:all:1.1.2` as a dependency of your project or build your own (see
+[instructions](https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries)).
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
+
+---
+
+## Migration guide
+
+### From 0.9 to 1.0
+
+In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
+breaking changes. If your data is sparse, please store it in a sparse format instead of dense to
+take advantage of sparsity in both storage and computation.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+We used to represent a feature vector by `Array[Double]`, which is replaced by
+[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) in v1.0. Algorithms that used
+to accept `RDD[Array[Double]]` now take
+`RDD[Vector]`. [`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint)
+is now a wrapper of `(Double, Vector)` instead of `(Double, Array[Double])`. Converting
+`Array[Double]` to `Vector` is straightforward:
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+val array: Array[Double] = ... // a double array
+val vector: Vector = Vectors.dense(array) // a dense vector
+{% endhighlight %}
+
+[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to create sparse vectors.
+
+*Note*. Scala imports `scala.collection.immutable.Vector` by default, so you have to import `org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
+
+</div>
+
+<div data-lang="java" markdown="1">
+
+We used to represent a feature vector by `double[]`, which is replaced by
+[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) in v1.0. Algorithms that used
+to accept `RDD<double[]>` now take
+`RDD<Vector>`. [`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint)
+is now a wrapper of `(double, Vector)` instead of `(double, double[])`. Converting `double[]` to
+`Vector` is straightforward:
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+
+double[] array = ... // a double array
+Vector vector = Vectors.dense(array) // a dense vector
+{% endhighlight %}
+
+[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to
+create sparse vectors.
+
+</div>
+
+<div data-lang="python" markdown="1">
+
+We used to represent a labeled feature vector in a NumPy array, where the first entry corresponds to
+the label and the rest are features. This representation is replaced by class
+[`LabeledPoint`](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html), which takes both
+dense and sparse feature vectors.
+
+{% highlight python %}
+from pyspark.mllib.linalg import SparseVector
+from pyspark.mllib.regression import LabeledPoint
+
+# Create a labeled point with a positive label and a dense feature vector.
+pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
+
+# Create a labeled point with a negative label and a sparse feature vector.
+neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
+{% endhighlight %}
+</div>
+</div>