From 81a8bd46acb682c47481d9bbb170685f9d2b0e02 Mon Sep 17 00:00:00 2001 From: Ameet Talwalkar Date: Sun, 8 Sep 2013 19:21:30 -0700 Subject: respose to PR comments --- docs/mllib-guide.md | 55 +++++++++++++++++++++++++++++------------------------ 1 file changed, 30 insertions(+), 25 deletions(-) (limited to 'docs/mllib-guide.md') diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 35850bdc95..1a629994cc 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -43,26 +43,20 @@ import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint // Load and parse the data file -val data = sc.textFile("sample_wiki_ngrams.txt") +val data = sc.textFile("mllib/data/sample_svm_data.txt") val parsedData = data.map(line => { val parts = line.split(' ') LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray) }) // Run training algorithm -val stepSizeVal = 1.0 -val regParamVal = 0.1 -val numIterationsVal = 200 -val miniBatchFractionVal = 1.0 +val numIterations = 20 val model = SVMWithSGD.train( parsedData, - numIterationsVal, - stepSizeVal, - regParamVal, - miniBatchFractionVal) + numIterations) // Evaluate model on training examples and compute training error -val labelAnPreds = parsedData.map(r => { +val labelAndPreds = parsedData.map(r => { val prediction = model.predict(r.features) (r.label, prediction) }) @@ -70,30 +64,31 @@ val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedDa println("trainError = " + trainErr) {% endhighlight %} -The `SVMWithSGD` algorithm performs L2 regularization by default. If we want to -configure this algorithm to generate an L1 regularized variant of SVMs, we can -use the builder design pattern as follows: +The `SVMWithSGD.train()` method by default performs L2 regularization with the +regularization parameter set to 1.0. If we want to configure this algorithm, we +can customize `SVMWithSGD` further by creating a new object directly and +calling setter methods. All other MLlib algorithms support customization in +this way as well. For example, the following code produces an L1 regularized +variant of SVMs with regularization parameter set to 0.1, and runs the training +algorithm for 200 iterations. {% highlight scala %} import org.apache.spark.mllib.optimization.L1Updater val svmAlg = new SVMWithSGD() svmAlg.optimizer.setNumIterations(200) - .setStepSize(1.0) .setRegParam(0.1) - .setMiniBatchFraction(1.0) -svmAlg.optimizer.setUpdater(new L1Updater) + .setUpdater(new L1Updater) val modelL1 = svmAlg.run(parsedData) {% endhighlight %} Both of the code snippets above can be executed in `spark-shell` to generate a -classifier for the provided dataset. Moreover, note that static methods and -builder patterns, similar to the ones displayed above, are available for all -algorithms in MLlib. +classifier for the provided dataset. -[SVMWithSGD](`api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD`) +Available algorithms for binary classification: -[LogisticRegressionWithSGD](`api/mllib/index.html#org.apache.spark.mllib.classification.LogistictRegressionWithSGD`) +* [SVMWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD) +* [LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD) # Linear Regression @@ -108,7 +103,11 @@ The regression algorithms in MLlib also leverage the underlying gradient descent primitive (described [below](#gradient-descent-primitive)), and have the same parameters as the binary classification algorithms described above. -[RidgeRegressionWithSGD](`api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD`) +Available algorithms for linear regression: + +* [LinearRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD) +* [RidgeRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD) +* [LassoWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LassoWithSGD) # Clustering @@ -134,7 +133,9 @@ a given dataset, the algorithm returns the best clustering result). * *initializiationSteps* determines the number of steps in the k-means\|\| algorithm. * *epsilon* determines the distance threshold within which we consider k-means to have converged. -[KMeans](`api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans`) +Available algorithms for clustering: + +* [KMeans](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans) # Collaborative Filtering @@ -154,7 +155,9 @@ following parameters: * *iterations* is the number of iterations to run. * *lambda* specifies the regularization parameter in ALS. -[ALS](`api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS`) +Available algorithms for collaborative filtering: + +* [ALS](api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS) # Gradient Descent Primitive @@ -183,4 +186,6 @@ stepSize / sqrt(t). * *miniBatchFraction* is the fraction of the data used to compute the gradient at each iteration. -[GradientDescent](`api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent`) +Available algorithms for gradient descent: + +* [GradientDescent](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) -- cgit v1.2.3