From d52edfa7539af4e8d439282bccdc81b3ea657f10 Mon Sep 17 00:00:00 2001 From: Ameet Talwalkar Date: Thu, 5 Sep 2013 21:06:50 -0700 Subject: updated content --- docs/mllib-guide.md | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 147 insertions(+), 1 deletion(-) (limited to 'docs/mllib-guide.md') diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index c897f8b36c..bb896c0897 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -3,4 +3,150 @@ layout: global title: Machine Learning Library (MLlib) --- -Coming soon. +MLlib is a Spark implementation of some common ML functionality, as well +associated unit tests and data generators. MLlib currently supports four +common types of machine learning problem settings, namely, binary +classification, regression, clustering and collaborative filtering, as well as an +underlying gradient descent optimization primitive. This guide will outline +the functionality supported in MLlib and also provides an example of invoking +MLlib. + +# Binary Classification + +Binary classification is a supervised learning problem in which we want to +classify entities into one of two distinct categories or labels, e.g., +predicting whether or not emails are spam. This problem involves executing a +learning *Algorithm* on a set of *labeled* examples, i.e., a set of entities +represented via (numerical) features along with underlying category labels. +The algorithm returns a trained *Model* that can predict the label for new +entities for which the underlying label is unknown. + +MLlib currently supports two standard model families for binary classification, +namely [Linear Support Vector Machines +(SVMs)](http://en.wikipedia.org/wiki/Support_vector_machine) and [Logistic +Regression](http://en.wikipedia.org/wiki/Logistic_regression), along with [L1 +and L2 regularized](http://en.wikipedia.org/wiki/Regularization_(mathematics)) +variants of each model family. The training algorithms all leverage an +underlying gradient descent primitive (described +[below](#gradient-descent-primitive)), and take as input a regularization +parameter (*regParam*) along with various parameters associated with gradient +descent (*stepSize*, *numIterations*, *miniBatchFraction*). + +The following code snippet illustrates how to load a sample dataset, execute a +training algorithm on this training data, and to make predictions with the +resulting model to compute the training error. + + import org.apache.spark.SparkContext + import org.apache.spark.mllib.classification.SVMWithSGD + import org.apache.spark.mllib.regression.LabeledPoint + + // Load and parse the data file + val data = sc.textFile("sample_wiki_ngrams.txt") + val parsedData = data.map(line => { + val parts = line.split(' ') + LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray) + }) + + // Run training algorithm + val svmAlg = new SVMWithSGD() + svmAlg.optimizer.setNumIterations(200) + .setStepSize(1.0) + .setRegParam(0.1) + .setMiniBatchFraction(1.0) + val model = svmAlg.run(parsedData) + + // Evaluate model on training examples and compute training error + val labelAndPreds = parsedData.map(r => { + val prediction = model.predict(r.features) + (r.label, prediction) + }) + val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count + println("trainError = " + trainErr) + +The `SVMWithSGD` algorithm performs L2 regularization by default, +and if we want to generate an L1 regularized variant of SVMs, we can do the +following: + + import org.apache.spark.mllib.optimization.L1Updater + svmAlg.optimizer.setUpdater(new L1Updater) + val modelL1 = svmAlg.run(parsedData) + +# Linear Regression + +Linear regression is another classical supervised learning setting. In this +problem, each entity is associated with a real-valued label (as opposed to a +binary label as in binary classification), and we want to predict labels as +closely as possible given numerical features representing entities. MLlib +supports linear regression as well as L1 +([lasso](http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method)) and L2 +([ridge](http://en.wikipedia.org/wiki/Ridge_regression)) regularized variants. +The regression algorithms in MLlib also leverage the underlying gradient +descent primitive (described [below](#gradient-descent-primitive)), and have +the same parameters as the binary classification algorithms described above. + +# Clustering + +Clustering is an unsupervised learning problem whereby we aim to group subsets +of entities with one another based on some notion of similarity. Clustering is +often used for exploratary analysis and/or as a component of a hierarchical +supervised learning pipeline (in which distinct classifiers or regression +models are trained for each cluster). MLlib supports +[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, arguably +the most commonly used clustering approach that clusters the data points into +*k* clusters. The implementation in MLlib has the following parameters: + +* *k* is the number of clusters. +* *maxIterations* is the maximum number of iterations to run. +* *initializationMode* specifies either random initialization or +initialization via a parallelized variant of the +[k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method. +* *runs* is the number of times to run the k-means algorithm (k-means is not +guaranteed to find a globally optimal solution, and when run multiple times on +a given dataset, the algorithm returns the best clustering result). +* *initializiationSteps* determines the number of steps in the k-means++ algorithm. +* *epsilon* determines the distance threshold within which we consider k-means to have converged. + +# Collaborative Filtering + +[Collaborative +filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering) +is commonly used for recommender systems. These techniques aim to fill in the +missing entries of a user-product association matrix. MLlib currently supports +model-based collaborative filtering, in which users and products are described +by a small set of latent factors that can be used to predict missing entries. +In particular, we implement the [alternating least squares +(ALS)](http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf) +algorithm to learn these latent factors. The implementation in MLlib has the +following parameters: + +* *numBlocks* is the number of blacks used to parallelize computation (set to -1 to auto-configure). +* *rank* is the number of latent factors in our model. +* *iterations* is the number of iterations to run. +* *lambda* specifies the regularization parameter in ALS. + +# Gradient Descent Primitive + +[Gradient descent](http://en.wikipedia.org/wiki/Gradient_descent) (along with +stochastic variants thereof) are first-order optimization methods that are +well-suited for large-scale and distributed computation. Gradient descent +methods aim to find a local minimum of a function by iteratively taking steps +in the direction of the negative gradient of the function at the current point, +i.e., the current parameter value. Gradient descent is included as a low-level +primitive in MLlib, upon which various ML algorithms are developed, and has the +following parameters: + +* *gradient* is a class that computes the stochastic gradient of the function +being optimized, i.e., with respect to a single training example, at the +current parameter value. MLlib includes gradient classes for common loss +functions, e.g., hinge, logistic, least-squares. The gradient class takes as +input a training example, its label, and the current parameter value. +* *updater* is a class that updates weights in each iteration of gradient +descent. MLlib includes updaters for cases without regularization, as well as +L1 and L2 regularizers. +* *stepSize* is a scalar value denoting the initial step size for gradient +descent. All updaters in MLlib use a step size at the t-th step equal to +stepSize / sqrt(t). +* *numIterations* is the number of iterations to run. +* *regParam* is the regularization parameter when using L1 or L2 regularization. +* *miniBatchFraction* is the fraction of the data used to compute the gradient +at each iteration. -- cgit v1.2.3