aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-classification-regression.md
diff options
context:
space:
mode:
authorMartin Jaggi <m.jaggi@gmail.com>2014-02-08 11:39:13 -0800
committerPatrick Wendell <pwendell@gmail.com>2014-02-08 11:39:13 -0800
commitfabf1749995103841e6a3975892572f376ee48d0 (patch)
treea9c03486cce6cc4f390405f33266a31861ebe3d4 /docs/mllib-classification-regression.md
parent3a9d82cc9e85accb5c1577cf4718aa44c8d5038c (diff)
downloadspark-fabf1749995103841e6a3975892572f376ee48d0.tar.gz
spark-fabf1749995103841e6a3975892572f376ee48d0.tar.bz2
spark-fabf1749995103841e6a3975892572f376ee48d0.zip
Merge pull request #552 from martinjaggi/master. Closes #552.
tex formulas in the documentation using mathjax. and spliting the MLlib documentation by techniques see jira https://spark-project.atlassian.net/browse/MLLIB-19 and https://github.com/shivaram/spark/compare/mathjax Author: Martin Jaggi <m.jaggi@gmail.com> == Merge branch commits == commit 0364bfabbfc347f917216057a20c39b631842481 Author: Martin Jaggi <m.jaggi@gmail.com> Date: Fri Feb 7 03:19:38 2014 +0100 minor polishing, as suggested by @pwendell commit dcd2142c164b2f602bf472bb152ad55bae82d31a Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 18:04:26 2014 +0100 enabling inline latex formulas with $.$ same mathjax configuration as used in math.stackexchange.com sample usage in the linear algebra (SVD) documentation commit bbafafd2b497a5acaa03a140bb9de1fbb7d67ffa Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 17:31:29 2014 +0100 split MLlib documentation by techniques and linked from the main mllib-guide.md site commit d1c5212b93c67436543c2d8ddbbf610fdf0a26eb Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 16:59:43 2014 +0100 enable mathjax formula in the .md documentation files code by @shivaram commit d73948db0d9bc36296054e79fec5b1a657b4eab4 Author: Martin Jaggi <m.jaggi@gmail.com> Date: Thu Feb 6 16:57:23 2014 +0100 minor update on how to compile the documentation
Diffstat (limited to 'docs/mllib-classification-regression.md')
-rw-r--r--docs/mllib-classification-regression.md206
1 files changed, 206 insertions, 0 deletions
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md
new file mode 100644
index 0000000000..edb9338907
--- /dev/null
+++ b/docs/mllib-classification-regression.md
@@ -0,0 +1,206 @@
+---
+layout: global
+title: MLlib - Classification and Regression
+---
+
+* Table of contents
+{:toc}
+
+
+# Binary Classification
+
+Binary classification is a supervised learning problem in which we want to
+classify entities into one of two distinct categories or labels, e.g.,
+predicting whether or not emails are spam. This problem involves executing a
+learning *Algorithm* on a set of *labeled* examples, i.e., a set of entities
+represented via (numerical) features along with underlying category labels.
+The algorithm returns a trained *Model* that can predict the label for new
+entities for which the underlying label is unknown.
+
+MLlib currently supports two standard model families for binary classification,
+namely [Linear Support Vector Machines
+(SVMs)](http://en.wikipedia.org/wiki/Support_vector_machine) and [Logistic
+Regression](http://en.wikipedia.org/wiki/Logistic_regression), along with [L1
+and L2 regularized](http://en.wikipedia.org/wiki/Regularization_(mathematics))
+variants of each model family. The training algorithms all leverage an
+underlying gradient descent primitive (described
+[below](#gradient-descent-primitive)), and take as input a regularization
+parameter (*regParam*) along with various parameters associated with gradient
+descent (*stepSize*, *numIterations*, *miniBatchFraction*).
+
+Available algorithms for binary classification:
+
+* [SVMWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD)
+* [LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD)
+
+# Linear Regression
+
+Linear regression is another classical supervised learning setting. In this
+problem, each entity is associated with a real-valued label (as opposed to a
+binary label as in binary classification), and we want to predict labels as
+closely as possible given numerical features representing entities. MLlib
+supports linear regression as well as L1
+([lasso](http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method)) and L2
+([ridge](http://en.wikipedia.org/wiki/Ridge_regression)) regularized variants.
+The regression algorithms in MLlib also leverage the underlying gradient
+descent primitive (described [below](#gradient-descent-primitive)), and have
+the same parameters as the binary classification algorithms described above.
+
+Available algorithms for linear regression:
+
+* [LinearRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD)
+* [RidgeRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
+* [LassoWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LassoWithSGD)
+
+Behind the scenes, all above methods use the SGD implementation from the
+gradient descent primitive in MLlib, see the
+<a href="mllib-optimization.html">optimization</a> part:
+
+* [GradientDescent](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+
+
+# Usage in Scala
+
+Following code snippets can be executed in `spark-shell`.
+
+## Binary Classification
+
+The following code snippet illustrates how to load a sample dataset, execute a
+training algorithm on this training data using a static method in the algorithm
+object, and make predictions with the resulting model to compute the training
+error.
+
+{% highlight scala %}
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.classification.SVMWithSGD
+import org.apache.spark.mllib.regression.LabeledPoint
+
+// Load and parse the data file
+val data = sc.textFile("mllib/data/sample_svm_data.txt")
+val parsedData = data.map { line =>
+ val parts = line.split(' ')
+ LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
+}
+
+// Run training algorithm to build the model
+val numIterations = 20
+val model = SVMWithSGD.train(parsedData, numIterations)
+
+// Evaluate model on training examples and compute training error
+val labelAndPreds = parsedData.map { point =>
+ val prediction = model.predict(point.features)
+ (point.label, prediction)
+}
+val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
+println("Training Error = " + trainErr)
+{% endhighlight %}
+
+
+The `SVMWithSGD.train()` method by default performs L2 regularization with the
+regularization parameter set to 1.0. If we want to configure this algorithm, we
+can customize `SVMWithSGD` further by creating a new object directly and
+calling setter methods. All other MLlib algorithms support customization in
+this way as well. For example, the following code produces an L1 regularized
+variant of SVMs with regularization parameter set to 0.1, and runs the training
+algorithm for 200 iterations.
+
+{% highlight scala %}
+import org.apache.spark.mllib.optimization.L1Updater
+
+val svmAlg = new SVMWithSGD()
+svmAlg.optimizer.setNumIterations(200)
+ .setRegParam(0.1)
+ .setUpdater(new L1Updater)
+val modelL1 = svmAlg.run(parsedData)
+{% endhighlight %}
+
+## Linear Regression
+The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The
+example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We
+compute the Mean Squared Error at the end to evaluate
+[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit)
+
+{% highlight scala %}
+import org.apache.spark.mllib.regression.LinearRegressionWithSGD
+import org.apache.spark.mllib.regression.LabeledPoint
+
+// Load and parse the data
+val data = sc.textFile("mllib/data/ridge-data/lpsa.data")
+val parsedData = data.map { line =>
+ val parts = line.split(',')
+ LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray)
+}
+
+// Building the model
+val numIterations = 20
+val model = LinearRegressionWithSGD.train(parsedData, numIterations)
+
+// Evaluate model on training examples and compute training error
+val valuesAndPreds = parsedData.map { point =>
+ val prediction = model.predict(point.features)
+ (point.label, prediction)
+}
+val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
+println("training Mean Squared Error = " + MSE)
+{% endhighlight %}
+
+
+Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training
+[Mean Squared Errors](http://en.wikipedia.org/wiki/Mean_squared_error).
+
+
+# Usage in Java
+
+All of MLlib's methods use Java-friendly types, so you can import and call them there the same
+way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
+Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
+calling `.rdd()` on your `JavaRDD` object.
+
+# Usage in Python
+Following examples can be tested in the PySpark shell.
+
+## Binary Classification
+The following example shows how to load a sample dataset, build Logistic Regression model,
+and make predictions with the resulting model to compute the training error.
+
+{% highlight python %}
+from pyspark.mllib.classification import LogisticRegressionWithSGD
+from numpy import array
+
+# Load and parse the data
+data = sc.textFile("mllib/data/sample_svm_data.txt")
+parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
+model = LogisticRegressionWithSGD.train(parsedData)
+
+# Build the model
+labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
+ model.predict(point.take(range(1, point.size)))))
+
+# Evaluating the model on training data
+trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
+print("Training Error = " + str(trainErr))
+{% endhighlight %}
+
+## Linear Regression
+The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The
+example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We
+compute the Mean Squared Error at the end to evaluate
+[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit)
+
+{% highlight python %}
+from pyspark.mllib.regression import LinearRegressionWithSGD
+from numpy import array
+
+# Load and parse the data
+data = sc.textFile("mllib/data/ridge-data/lpsa.data")
+parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')]))
+
+# Build the model
+model = LinearRegressionWithSGD.train(parsedData)
+
+# Evaluate the model on training data
+valuesAndPreds = parsedData.map(lambda point: (point.item(0),
+ model.predict(point.take(range(1, point.size)))))
+MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y)/valuesAndPreds.count()
+print("Mean Squared Error = " + str(MSE))
+{% endhighlight %}