diff options
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r-- | docs/mllib-guide.md | 27 |
1 files changed, 25 insertions, 2 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 1ac5cc13db..4236b0c8b6 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -7,8 +7,9 @@ title: Machine Learning Library (MLlib) MLlib is a Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators. MLlib currently supports four common types of machine learning problem settings, -namely, binary classification, regression, clustering and collaborative -filtering, as well as an underlying gradient descent optimization primitive. +namely classification, regression, clustering and collaborative filtering, +as well as an underlying gradient descent optimization primitive and several +linear algebra methods. # Available Methods The following links provide a detailed explanation of the methods and usage examples for each of them: @@ -32,6 +33,28 @@ The following links provide a detailed explanation of the methods and usage exam * Singular Value Decomposition * Principal Component Analysis +# Data Types + +Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the +[Vector](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) class is used to +represent vectors. You can create either dense or sparse vectors using the +[Vectors](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) factory. + +In Python, MLlib can take the following vector types: + +* [NumPy](http://www.numpy.org) arrays +* Standard Python lists (e.g. `[1, 2, 3]`) +* The MLlib [SparseVector](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html) class +* [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html) + +For efficiency, we recommend using NumPy arrays over lists, and using the +[CSC format](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix) +for SciPy matrices, or MLlib's own SparseVector class. + +Several other simple data types are used throughout the library, e.g. the LabeledPoint +class ([Java/Scala](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint), +[Python](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data. + # Dependencies MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself depends on native Fortran routines. You may need to install the |