aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-guide.md
diff options
context:
space:
mode:
authorXiangrui Meng <meng@databricks.com>2014-04-22 11:20:47 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-04-22 11:20:47 -0700
commit26d35f3fd942761b0adecd1a720e1fa834db4de9 (patch)
tree16e57e2ff01e7cd2d7a1a3c1f3bf98c9cf98a082 /docs/mllib-guide.md
parentbf9d49b6d1f668b49795c2d380ab7d64ec0029da (diff)
downloadspark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.gz
spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.bz2
spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.zip
[SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0
Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng <meng@databricks.com> Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r--docs/mllib-guide.md172
1 files changed, 115 insertions, 57 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 0963a99881..c49f857d07 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -3,63 +3,121 @@ layout: global
title: Machine Learning Library (MLlib)
---
+MLlib is a Spark implementation of some common machine learning algorithms and utilities,
+including classification, regression, clustering, collaborative
+filtering, dimensionality reduction, as well as underlying optimization primitives:
-MLlib is a Spark implementation of some common machine learning (ML)
-functionality, as well associated tests and data generators. MLlib
-currently supports four common types of machine learning problem settings,
-namely classification, regression, clustering and collaborative filtering,
-as well as an underlying gradient descent optimization primitive and several
-linear algebra methods.
-
-# Available Methods
-The following links provide a detailed explanation of the methods and usage examples for each of them:
-
-* <a href="mllib-classification-regression.html">Classification and Regression</a>
- * Binary Classification
- * SVM (L1 and L2 regularized)
- * Logistic Regression (L1 and L2 regularized)
- * Linear Regression
- * Least Squares
- * Lasso
- * Ridge Regression
- * Decision Tree (for classification and regression)
-* <a href="mllib-clustering.html">Clustering</a>
- * k-Means
-* <a href="mllib-collaborative-filtering.html">Collaborative Filtering</a>
- * Matrix Factorization using Alternating Least Squares
-* <a href="mllib-optimization.html">Optimization</a>
- * Gradient Descent and Stochastic Gradient Descent
-* <a href="mllib-linear-algebra.html">Linear Algebra</a>
- * Singular Value Decomposition
- * Principal Component Analysis
-
-# Data Types
-
-Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the
-[Vector](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) class is used to
-represent vectors. You can create either dense or sparse vectors using the
-[Vectors](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) factory.
-
-In Python, MLlib can take the following vector types:
-
-* [NumPy](http://www.numpy.org) arrays
-* Standard Python lists (e.g. `[1, 2, 3]`)
-* The MLlib [SparseVector](api/python/pyspark.mllib.linalg.SparseVector-class.html) class
-* [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html)
-
-For efficiency, we recommend using NumPy arrays over lists, and using the
-[CSC format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
-for SciPy matrices, or MLlib's own SparseVector class.
-
-Several other simple data types are used throughout the library, e.g. the LabeledPoint
-class ([Java/Scala](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint),
-[Python](api/python/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data.
-
-# Dependencies
-MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself
-depends on native Fortran routines. You may need to install the
-[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries)
-if it is not already present on your nodes. MLlib will throw a linking error if it cannot
-detect these libraries automatically.
+* [Basics](mllib-basics.html)
+ * data types
+ * summary statistics
+* Classification and regression
+ * [linear support vector machine (SVM)](mllib-linear-methods.html#linear-support-vector-machine-svm)
+ * [logistic regression](mllib-linear-methods.html#logistic-regression)
+ * [linear least squares, Lasso, and ridge regression](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
+ * [decision tree](mllib-decision-tree.html)
+ * [naive Bayes](mllib-naive-bayes.html)
+* [Collaborative filtering](mllib-collaborative-filtering.html)
+ * alternating least squares (ALS)
+* [Clustering](mllib-clustering.html)
+ * k-means
+* [Dimensionality reduction](mllib-dimensionality-reduction.html)
+ * singular value decomposition (SVD)
+ * principal component analysis (PCA)
+* [Optimization](mllib-optimization.html)
+ * stochastic gradient descent
+ * limited-memory BFGS (L-BFGS)
+
+MLlib is currently a *beta* component under active development.
+The APIs may change in the future releases, and we will provide migration guide between releases.
+
+## Dependencies
+
+MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), which depends on
+[netlib-java](https://github.com/fommil/netlib-java), and
+[jblas](https://github.com/mikiobraun/jblas).
+`netlib-java` and `jblas` depend on native Fortran routines.
+You need to install the
+[gfortran runtime library](https://github.com/mikiobraun/jblas/wiki/Missing-Libraries) if it is not
+already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries
+automatically. Due to license issues, we do not include `netlib-java`'s native libraries in MLlib's
+dependency set. If no native library is available at runtime, you will see a warning message. To
+use native libraries from `netlib-java`, please include artifact
+`com.github.fommil.netlib:all:1.1.2` as a dependency of your project or build your own (see
+[instructions](https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries)).
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
+
+---
+
+## Migration guide
+
+### From 0.9 to 1.0
+
+In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
+breaking changes. If your data is sparse, please store it in a sparse format instead of dense to
+take advantage of sparsity in both storage and computation.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+We used to represent a feature vector by `Array[Double]`, which is replaced by
+[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) in v1.0. Algorithms that used
+to accept `RDD[Array[Double]]` now take
+`RDD[Vector]`. [`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint)
+is now a wrapper of `(Double, Vector)` instead of `(Double, Array[Double])`. Converting
+`Array[Double]` to `Vector` is straightforward:
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+val array: Array[Double] = ... // a double array
+val vector: Vector = Vectors.dense(array) // a dense vector
+{% endhighlight %}
+
+[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to create sparse vectors.
+
+*Note*. Scala imports `scala.collection.immutable.Vector` by default, so you have to import `org.apache.spark.mllib.linalg.Vector` explicitly to use MLlib's `Vector`.
+
+</div>
+
+<div data-lang="java" markdown="1">
+
+We used to represent a feature vector by `double[]`, which is replaced by
+[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) in v1.0. Algorithms that used
+to accept `RDD<double[]>` now take
+`RDD<Vector>`. [`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint)
+is now a wrapper of `(double, Vector)` instead of `(double, double[])`. Converting `double[]` to
+`Vector` is straightforward:
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vectors;
+
+double[] array = ... // a double array
+Vector vector = Vectors.dense(array) // a dense vector
+{% endhighlight %}
+
+[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to
+create sparse vectors.
+
+</div>
+
+<div data-lang="python" markdown="1">
+
+We used to represent a labeled feature vector in a NumPy array, where the first entry corresponds to
+the label and the rest are features. This representation is replaced by class
+[`LabeledPoint`](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html), which takes both
+dense and sparse feature vectors.
+
+{% highlight python %}
+from pyspark.mllib.linalg import SparseVector
+from pyspark.mllib.regression import LabeledPoint
+
+# Create a labeled point with a positive label and a dense feature vector.
+pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
+
+# Create a labeled point with a negative label and a sparse feature vector.
+neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
+{% endhighlight %}
+</div>
+</div>