aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-guide.md
diff options
context:
space:
mode:
authorAmeet Talwalkar <atalwalkar@gmail.com>2013-09-08 19:21:30 -0700
committerAmeet Talwalkar <atalwalkar@gmail.com>2013-09-08 19:21:30 -0700
commit81a8bd46acb682c47481d9bbb170685f9d2b0e02 (patch)
tree065640f9fab8fc803edd6d41a5aaa86b5e74f9fc /docs/mllib-guide.md
parentbf280c8b0faa542061a42f9ea56e93380b6d37f8 (diff)
downloadspark-81a8bd46acb682c47481d9bbb170685f9d2b0e02.tar.gz
spark-81a8bd46acb682c47481d9bbb170685f9d2b0e02.tar.bz2
spark-81a8bd46acb682c47481d9bbb170685f9d2b0e02.zip
respose to PR comments
Diffstat (limited to 'docs/mllib-guide.md')
-rw-r--r--docs/mllib-guide.md55
1 files changed, 30 insertions, 25 deletions
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 35850bdc95..1a629994cc 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -43,26 +43,20 @@ import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
// Load and parse the data file
-val data = sc.textFile("sample_wiki_ngrams.txt")
+val data = sc.textFile("mllib/data/sample_svm_data.txt")
val parsedData = data.map(line => {
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
})
// Run training algorithm
-val stepSizeVal = 1.0
-val regParamVal = 0.1
-val numIterationsVal = 200
-val miniBatchFractionVal = 1.0
+val numIterations = 20
val model = SVMWithSGD.train(
parsedData,
- numIterationsVal,
- stepSizeVal,
- regParamVal,
- miniBatchFractionVal)
+ numIterations)
// Evaluate model on training examples and compute training error
-val labelAnPreds = parsedData.map(r => {
+val labelAndPreds = parsedData.map(r => {
val prediction = model.predict(r.features)
(r.label, prediction)
})
@@ -70,30 +64,31 @@ val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedDa
println("trainError = " + trainErr)
{% endhighlight %}
-The `SVMWithSGD` algorithm performs L2 regularization by default. If we want to
-configure this algorithm to generate an L1 regularized variant of SVMs, we can
-use the builder design pattern as follows:
+The `SVMWithSGD.train()` method by default performs L2 regularization with the
+regularization parameter set to 1.0. If we want to configure this algorithm, we
+can customize `SVMWithSGD` further by creating a new object directly and
+calling setter methods. All other MLlib algorithms support customization in
+this way as well. For example, the following code produces an L1 regularized
+variant of SVMs with regularization parameter set to 0.1, and runs the training
+algorithm for 200 iterations.
{% highlight scala %}
import org.apache.spark.mllib.optimization.L1Updater
val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200)
- .setStepSize(1.0)
.setRegParam(0.1)
- .setMiniBatchFraction(1.0)
-svmAlg.optimizer.setUpdater(new L1Updater)
+ .setUpdater(new L1Updater)
val modelL1 = svmAlg.run(parsedData)
{% endhighlight %}
Both of the code snippets above can be executed in `spark-shell` to generate a
-classifier for the provided dataset. Moreover, note that static methods and
-builder patterns, similar to the ones displayed above, are available for all
-algorithms in MLlib.
+classifier for the provided dataset.
-[SVMWithSGD](`api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD`)
+Available algorithms for binary classification:
-[LogisticRegressionWithSGD](`api/mllib/index.html#org.apache.spark.mllib.classification.LogistictRegressionWithSGD`)
+* [SVMWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD)
+* [LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD)
# Linear Regression
@@ -108,7 +103,11 @@ The regression algorithms in MLlib also leverage the underlying gradient
descent primitive (described [below](#gradient-descent-primitive)), and have
the same parameters as the binary classification algorithms described above.
-[RidgeRegressionWithSGD](`api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD`)
+Available algorithms for linear regression:
+
+* [LinearRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD)
+* [RidgeRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
+* [LassoWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LassoWithSGD)
# Clustering
@@ -134,7 +133,9 @@ a given dataset, the algorithm returns the best clustering result).
* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
-[KMeans](`api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans`)
+Available algorithms for clustering:
+
+* [KMeans](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans)
# Collaborative Filtering
@@ -154,7 +155,9 @@ following parameters:
* *iterations* is the number of iterations to run.
* *lambda* specifies the regularization parameter in ALS.
-[ALS](`api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS`)
+Available algorithms for collaborative filtering:
+
+* [ALS](api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS)
# Gradient Descent Primitive
@@ -183,4 +186,6 @@ stepSize / sqrt(t).
* *miniBatchFraction* is the fraction of the data used to compute the gradient
at each iteration.
-[GradientDescent](`api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent`)
+Available algorithms for gradient descent:
+
+* [GradientDescent](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)