aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-guide.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r--docs/mllib-guide.md27
1 files changed, 25 insertions, 2 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 1ac5cc13db..4236b0c8b6 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -7,8 +7,9 @@ title: Machine Learning Library (MLlib)
MLlib is a Spark implementation of some common machine learning (ML)
functionality, as well associated tests and data generators. MLlib
currently supports four common types of machine learning problem settings,
-namely, binary classification, regression, clustering and collaborative
-filtering, as well as an underlying gradient descent optimization primitive.
+namely classification, regression, clustering and collaborative filtering,
+as well as an underlying gradient descent optimization primitive and several
+linear algebra methods.
# Available Methods
The following links provide a detailed explanation of the methods and usage examples for each of them:
@@ -32,6 +33,28 @@ The following links provide a detailed explanation of the methods and usage exam
* Singular Value Decomposition
* Principal Component Analysis
+# Data Types
+
+Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the
+[Vector](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) class is used to
+represent vectors. You can create either dense or sparse vectors using the
+[Vectors](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) factory.
+
+In Python, MLlib can take the following vector types:
+
+* [NumPy](http://www.numpy.org) arrays
+* Standard Python lists (e.g. `[1, 2, 3]`)
+* The MLlib [SparseVector](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html) class
+* [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html)
+
+For efficiency, we recommend using NumPy arrays over lists, and using the
+[CSC format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
+for SciPy matrices, or MLlib's own SparseVector class.
+
+Several other simple data types are used throughout the library, e.g. the LabeledPoint
+class ([Java/Scala](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint),
+[Python](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data.
+
# Dependencies
MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself
depends on native Fortran routines. You may need to install the