aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-linear-methods.md
diff options
context:
space:
mode:
authorfreeman <the.freeman.lab@gmail.com>2014-08-19 18:07:42 -0700
committerXiangrui Meng <meng@databricks.com>2014-08-19 18:07:42 -0700
commitc7252b0097cfacd36f17357d195b12a59e503b35 (patch)
treee0085ea38c57b42288e8ffcea61cae1dc595c729 /docs/mllib-linear-methods.md
parent1870dbaa5591883e61b2173d064c1a67e871b0f5 (diff)
downloadspark-c7252b0097cfacd36f17357d195b12a59e503b35.tar.gz
spark-c7252b0097cfacd36f17357d195b12a59e503b35.tar.bz2
spark-c7252b0097cfacd36f17357d195b12a59e503b35.zip
[SPARK-3112][MLLIB] Add documentation and example for StreamingLR
Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example. mengxr tdas Author: freeman <the.freeman.lab@gmail.com> Closes #2047 from freeman-lab/streaming-lr-docs and squashes the following commits: 568d250 [freeman] Tweaks to wording / formatting 05a1139 [freeman] Added documentation and example for StreamingLR
Diffstat (limited to 'docs/mllib-linear-methods.md')
-rw-r--r--docs/mllib-linear-methods.md75
1 files changed, 75 insertions, 0 deletions
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index e504cd7f0f..9137f9dc1b 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -518,6 +518,81 @@ print("Mean Squared Error = " + str(MSE))
</div>
</div>
+## Streaming linear regression
+
+When data arrive in a streaming fashion, it is useful to fit regression models online,
+updating the parameters of the model as new data arrives. MLlib currently supports
+streaming linear regression using ordinary least squares. The fitting is similar
+to that performed offline, except fitting occurs on each batch of data, so that
+the model continually updates to reflect the data from the stream.
+
+### Examples
+
+The following example demonstrates how to load training and testing data from two different
+input streams of text files, parse the streams as labeled points, fit a linear regression model
+online to the first stream, and make predictions on the second stream.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+
+First, we import the necessary classes for parsing our input data and creating the model.
+
+{% highlight scala %}
+
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
+
+{% endhighlight %}
+
+Then we make input streams for training and testing data. We assume a StreamingContext `ssc`
+has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
+for more info. For this example, we use labeled points in training and testing streams,
+but in practice you will likely want to use unlabeled vectors for test data.
+
+{% highlight scala %}
+
+val trainingData = ssc.textFileStream('/training/data/dir').map(LabeledPoint.parse)
+val testData = ssc.textFileStream('/testing/data/dir').map(LabeledPoint.parse)
+
+{% endhighlight %}
+
+We create our model by initializing the weights to 0
+
+{% highlight scala %}
+
+val numFeatures = 3
+val model = new StreamingLinearRegressionWithSGD()
+ .setInitialWeights(Vectors.zeros(numFeatures))
+
+{% endhighlight %}
+
+Now we register the streams for training and testing and start the job.
+Printing predictions alongside true labels lets us easily see the result.
+
+{% highlight scala %}
+
+model.trainOn(trainingData)
+model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
+
+ssc.start()
+ssc.awaitTermination()
+
+{% endhighlight %}
+
+We can now save text files with data to the training or testing folders.
+Each line should be a data point formatted as `(y,[x1,x2,x3])` where `y` is the label
+and `x1,x2,x3` are the features. Anytime a text file is placed in `/training/data/dir`
+the model will update. Anytime a text file is placed in `/testing/data/dir` you will see predictions.
+As you feed more data to the training directory, the predictions
+will get better!
+
+</div>
+
+</div>
+
+
## Implementation (developer)
Behind the scene, MLlib implements a simple distributed version of stochastic gradient descent