aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-guide.md
diff options
context:
space:
mode:
authorAmeet Talwalkar <atalwalkar@gmail.com>2013-09-05 21:06:50 -0700
committerAmeet Talwalkar <atalwalkar@gmail.com>2013-09-05 21:06:50 -0700
commitd52edfa7539af4e8d439282bccdc81b3ea657f10 (patch)
treeff8cee34106acb6551603731b35245d84d7665bd /docs/mllib-guide.md
parenta54786678fb4f3663f90ecc063af3b3028aff650 (diff)
downloadspark-d52edfa7539af4e8d439282bccdc81b3ea657f10.tar.gz
spark-d52edfa7539af4e8d439282bccdc81b3ea657f10.tar.bz2
spark-d52edfa7539af4e8d439282bccdc81b3ea657f10.zip
updated content
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r--docs/mllib-guide.md148
1 files changed, 147 insertions, 1 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index c897f8b36c..bb896c0897 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -3,4 +3,150 @@ layout: global
title: Machine Learning Library (MLlib)
---
-Coming soon.
+MLlib is a Spark implementation of some common ML functionality, as well
+associated unit tests and data generators. MLlib currently supports four
+common types of machine learning problem settings, namely, binary
+classification, regression, clustering and collaborative filtering, as well as an
+underlying gradient descent optimization primitive. This guide will outline
+the functionality supported in MLlib and also provides an example of invoking
+MLlib.
+
+# Binary Classification
+
+Binary classification is a supervised learning problem in which we want to
+classify entities into one of two distinct categories or labels, e.g.,
+predicting whether or not emails are spam. This problem involves executing a
+learning *Algorithm* on a set of *labeled* examples, i.e., a set of entities
+represented via (numerical) features along with underlying category labels.
+The algorithm returns a trained *Model* that can predict the label for new
+entities for which the underlying label is unknown.
+
+MLlib currently supports two standard model families for binary classification,
+namely [Linear Support Vector Machines
+(SVMs)](http://en.wikipedia.org/wiki/Support_vector_machine) and [Logistic
+Regression](http://en.wikipedia.org/wiki/Logistic_regression), along with [L1
+and L2 regularized](http://en.wikipedia.org/wiki/Regularization_(mathematics))
+variants of each model family. The training algorithms all leverage an
+underlying gradient descent primitive (described
+[below](#gradient-descent-primitive)), and take as input a regularization
+parameter (*regParam*) along with various parameters associated with gradient
+descent (*stepSize*, *numIterations*, *miniBatchFraction*).
+
+The following code snippet illustrates how to load a sample dataset, execute a
+training algorithm on this training data, and to make predictions with the
+resulting model to compute the training error.
+
+ import org.apache.spark.SparkContext
+ import org.apache.spark.mllib.classification.SVMWithSGD
+ import org.apache.spark.mllib.regression.LabeledPoint
+
+ // Load and parse the data file
+ val data = sc.textFile("sample_wiki_ngrams.txt")
+ val parsedData = data.map(line => {
+ val parts = line.split(' ')
+ LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
+ })
+
+ // Run training algorithm
+ val svmAlg = new SVMWithSGD()
+ svmAlg.optimizer.setNumIterations(200)
+ .setStepSize(1.0)
+ .setRegParam(0.1)
+ .setMiniBatchFraction(1.0)
+ val model = svmAlg.run(parsedData)
+
+ // Evaluate model on training examples and compute training error
+ val labelAndPreds = parsedData.map(r => {
+ val prediction = model.predict(r.features)
+ (r.label, prediction)
+ })
+ val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
+ println("trainError = " + trainErr)
+
+The `SVMWithSGD` algorithm performs L2 regularization by default,
+and if we want to generate an L1 regularized variant of SVMs, we can do the
+following:
+
+ import org.apache.spark.mllib.optimization.L1Updater
+ svmAlg.optimizer.setUpdater(new L1Updater)
+ val modelL1 = svmAlg.run(parsedData)
+
+# Linear Regression
+
+Linear regression is another classical supervised learning setting. In this
+problem, each entity is associated with a real-valued label (as opposed to a
+binary label as in binary classification), and we want to predict labels as
+closely as possible given numerical features representing entities. MLlib
+supports linear regression as well as L1
+([lasso](http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method)) and L2
+([ridge](http://en.wikipedia.org/wiki/Ridge_regression)) regularized variants.
+The regression algorithms in MLlib also leverage the underlying gradient
+descent primitive (described [below](#gradient-descent-primitive)), and have
+the same parameters as the binary classification algorithms described above.
+
+# Clustering
+
+Clustering is an unsupervised learning problem whereby we aim to group subsets
+of entities with one another based on some notion of similarity. Clustering is
+often used for exploratary analysis and/or as a component of a hierarchical
+supervised learning pipeline (in which distinct classifiers or regression
+models are trained for each cluster). MLlib supports
+[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, arguably
+the most commonly used clustering approach that clusters the data points into
+*k* clusters. The implementation in MLlib has the following parameters:
+
+* *k* is the number of clusters.
+* *maxIterations* is the maximum number of iterations to run.
+* *initializationMode* specifies either random initialization or
+initialization via a parallelized variant of the
+[k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method.
+* *runs* is the number of times to run the k-means algorithm (k-means is not
+guaranteed to find a globally optimal solution, and when run multiple times on
+a given dataset, the algorithm returns the best clustering result).
+* *initializiationSteps* determines the number of steps in the k-means++ algorithm.
+* *epsilon* determines the distance threshold within which we consider k-means to have converged.
+
+# Collaborative Filtering
+
+[Collaborative
+filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
+is commonly used for recommender systems. These techniques aim to fill in the
+missing entries of a user-product association matrix. MLlib currently supports
+model-based collaborative filtering, in which users and products are described
+by a small set of latent factors that can be used to predict missing entries.
+In particular, we implement the [alternating least squares
+(ALS)](http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf)
+algorithm to learn these latent factors. The implementation in MLlib has the
+following parameters:
+
+* *numBlocks* is the number of blacks used to parallelize computation (set to -1 to auto-configure).
+* *rank* is the number of latent factors in our model.
+* *iterations* is the number of iterations to run.
+* *lambda* specifies the regularization parameter in ALS.
+
+# Gradient Descent Primitive
+
+[Gradient descent](http://en.wikipedia.org/wiki/Gradient_descent) (along with
+stochastic variants thereof) are first-order optimization methods that are
+well-suited for large-scale and distributed computation. Gradient descent
+methods aim to find a local minimum of a function by iteratively taking steps
+in the direction of the negative gradient of the function at the current point,
+i.e., the current parameter value. Gradient descent is included as a low-level
+primitive in MLlib, upon which various ML algorithms are developed, and has the
+following parameters:
+
+* *gradient* is a class that computes the stochastic gradient of the function
+being optimized, i.e., with respect to a single training example, at the
+current parameter value. MLlib includes gradient classes for common loss
+functions, e.g., hinge, logistic, least-squares. The gradient class takes as
+input a training example, its label, and the current parameter value.
+* *updater* is a class that updates weights in each iteration of gradient
+descent. MLlib includes updaters for cases without regularization, as well as
+L1 and L2 regularizers.
+* *stepSize* is a scalar value denoting the initial step size for gradient
+descent. All updaters in MLlib use a step size at the t-th step equal to
+stepSize / sqrt(t).
+* *numIterations* is the number of iterations to run.
+* *regParam* is the regularization parameter when using L1 or L2 regularization.
+* *miniBatchFraction* is the fraction of the data used to compute the gradient
+at each iteration.