aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-guide.md
diff options
context:
space:
mode:
authorAmeet Talwalkar <atalwalkar@gmail.com>2013-09-08 17:39:08 -0700
committerAmeet Talwalkar <atalwalkar@gmail.com>2013-09-08 17:39:08 -0700
commit5ac62dbbd0d604d699017a5956f3c79172e09896 (patch)
tree357788ae5735d9aadf4ce998c993e1345a5117f1 /docs/mllib-guide.md
parentd52edfa7539af4e8d439282bccdc81b3ea657f10 (diff)
downloadspark-5ac62dbbd0d604d699017a5956f3c79172e09896.tar.gz
spark-5ac62dbbd0d604d699017a5956f3c79172e09896.tar.bz2
spark-5ac62dbbd0d604d699017a5956f3c79172e09896.zip
updates based on comments to PR
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r--docs/mllib-guide.md132
1 files changed, 83 insertions, 49 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index bb896c0897..35850bdc95 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -3,13 +3,13 @@ layout: global
title: Machine Learning Library (MLlib)
---
-MLlib is a Spark implementation of some common ML functionality, as well
-associated unit tests and data generators. MLlib currently supports four
-common types of machine learning problem settings, namely, binary
-classification, regression, clustering and collaborative filtering, as well as an
-underlying gradient descent optimization primitive. This guide will outline
-the functionality supported in MLlib and also provides an example of invoking
-MLlib.
+MLlib is a Spark implementation of some common machine learning (ML)
+functionality, as well associated unit tests and data generators. MLlib
+currently supports four common types of machine learning problem settings,
+namely, binary classification, regression, clustering and collaborative
+filtering, as well as an underlying gradient descent optimization primitive.
+This guide will outline the functionality supported in MLlib and also provides
+an example of invoking MLlib.
# Binary Classification
@@ -33,43 +33,67 @@ parameter (*regParam*) along with various parameters associated with gradient
descent (*stepSize*, *numIterations*, *miniBatchFraction*).
The following code snippet illustrates how to load a sample dataset, execute a
-training algorithm on this training data, and to make predictions with the
-resulting model to compute the training error.
-
- import org.apache.spark.SparkContext
- import org.apache.spark.mllib.classification.SVMWithSGD
- import org.apache.spark.mllib.regression.LabeledPoint
-
- // Load and parse the data file
- val data = sc.textFile("sample_wiki_ngrams.txt")
- val parsedData = data.map(line => {
- val parts = line.split(' ')
- LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
- })
-
- // Run training algorithm
- val svmAlg = new SVMWithSGD()
- svmAlg.optimizer.setNumIterations(200)
- .setStepSize(1.0)
- .setRegParam(0.1)
- .setMiniBatchFraction(1.0)
- val model = svmAlg.run(parsedData)
-
- // Evaluate model on training examples and compute training error
- val labelAndPreds = parsedData.map(r => {
- val prediction = model.predict(r.features)
- (r.label, prediction)
- })
- val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
- println("trainError = " + trainErr)
-
-The `SVMWithSGD` algorithm performs L2 regularization by default,
-and if we want to generate an L1 regularized variant of SVMs, we can do the
-following:
-
- import org.apache.spark.mllib.optimization.L1Updater
- svmAlg.optimizer.setUpdater(new L1Updater)
- val modelL1 = svmAlg.run(parsedData)
+training algorithm on this training data using a static method in the algorithm
+object, and make predictions with the resulting model to compute the training
+error.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.classification.SVMWithSGD
+import org.apache.spark.mllib.regression.LabeledPoint
+
+// Load and parse the data file
+val data = sc.textFile("sample_wiki_ngrams.txt")
+val parsedData = data.map(line => {
+ val parts = line.split(' ')
+ LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
+})
+
+// Run training algorithm
+val stepSizeVal = 1.0
+val regParamVal = 0.1
+val numIterationsVal = 200
+val miniBatchFractionVal = 1.0
+val model = SVMWithSGD.train(
+ parsedData,
+ numIterationsVal,
+ stepSizeVal,
+ regParamVal,
+ miniBatchFractionVal)
+
+// Evaluate model on training examples and compute training error
+val labelAnPreds = parsedData.map(r => {
+ val prediction = model.predict(r.features)
+ (r.label, prediction)
+})
+val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
+println("trainError = " + trainErr)
+{% endhighlight %}
+
+The `SVMWithSGD` algorithm performs L2 regularization by default. If we want to
+configure this algorithm to generate an L1 regularized variant of SVMs, we can
+use the builder design pattern as follows:
+
+{% highlight scala %}
+import org.apache.spark.mllib.optimization.L1Updater
+
+val svmAlg = new SVMWithSGD()
+svmAlg.optimizer.setNumIterations(200)
+ .setStepSize(1.0)
+ .setRegParam(0.1)
+ .setMiniBatchFraction(1.0)
+svmAlg.optimizer.setUpdater(new L1Updater)
+val modelL1 = svmAlg.run(parsedData)
+{% endhighlight %}
+
+Both of the code snippets above can be executed in `spark-shell` to generate a
+classifier for the provided dataset. Moreover, note that static methods and
+builder patterns, similar to the ones displayed above, are available for all
+algorithms in MLlib.
+
+[SVMWithSGD](`api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD`)
+
+[LogisticRegressionWithSGD](`api/mllib/index.html#org.apache.spark.mllib.classification.LogistictRegressionWithSGD`)
# Linear Regression
@@ -84,28 +108,34 @@ The regression algorithms in MLlib also leverage the underlying gradient
descent primitive (described [below](#gradient-descent-primitive)), and have
the same parameters as the binary classification algorithms described above.
+[RidgeRegressionWithSGD](`api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD`)
+
# Clustering
Clustering is an unsupervised learning problem whereby we aim to group subsets
of entities with one another based on some notion of similarity. Clustering is
-often used for exploratary analysis and/or as a component of a hierarchical
+often used for exploratory analysis and/or as a component of a hierarchical
supervised learning pipeline (in which distinct classifiers or regression
models are trained for each cluster). MLlib supports
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, arguably
the most commonly used clustering approach that clusters the data points into
-*k* clusters. The implementation in MLlib has the following parameters:
+*k* clusters. The MLlib implementation includes a parallelized
+variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
+called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
+The implementation in MLlib has the following parameters:
* *k* is the number of clusters.
* *maxIterations* is the maximum number of iterations to run.
* *initializationMode* specifies either random initialization or
-initialization via a parallelized variant of the
-[k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method.
+initialization via k-means\|\|.
* *runs* is the number of times to run the k-means algorithm (k-means is not
guaranteed to find a globally optimal solution, and when run multiple times on
a given dataset, the algorithm returns the best clustering result).
-* *initializiationSteps* determines the number of steps in the k-means++ algorithm.
+* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
+[KMeans](`api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans`)
+
# Collaborative Filtering
[Collaborative
@@ -124,6 +154,8 @@ following parameters:
* *iterations* is the number of iterations to run.
* *lambda* specifies the regularization parameter in ALS.
+[ALS](`api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS`)
+
# Gradient Descent Primitive
[Gradient descent](http://en.wikipedia.org/wiki/Gradient_descent) (along with
@@ -150,3 +182,5 @@ stepSize / sqrt(t).
* *regParam* is the regularization parameter when using L1 or L2 regularization.
* *miniBatchFraction* is the fraction of the data used to compute the gradient
at each iteration.
+
+[GradientDescent](`api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent`)