aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorJoseph K. Bradley <joseph@databricks.com>2015-02-25 16:13:17 -0800
committerXiangrui Meng <meng@databricks.com>2015-02-25 16:13:17 -0800
commitd20559b157743981b9c09e286f2aaff8cbefab59 (patch)
tree6d92015c1ae6b05c725860685351f86b8c4ed6af /docs
parent46a044a36a2aff1306f7f677e952ce253ddbefac (diff)
downloadspark-d20559b157743981b9c09e286f2aaff8cbefab59.tar.gz
spark-d20559b157743981b9c09e286f2aaff8cbefab59.tar.bz2
spark-d20559b157743981b9c09e286f2aaff8cbefab59.zip
[SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT
* Add GradientBoostedTrees Python examples to ML guide * I ran these in the pyspark shell, and they worked. * Add save/load to examples in ML guide * Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981) CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits: c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide. Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet). 6d81c3e [Joseph K. Bradley] completed python GBT examples 9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide. Added GBT examples to ML guide
Diffstat (limited to 'docs')
-rw-r--r--docs/mllib-classification-regression.md9
-rw-r--r--docs/mllib-collaborative-filtering.md9
-rw-r--r--docs/mllib-decision-tree.md20
-rw-r--r--docs/mllib-ensembles.md94
-rw-r--r--docs/mllib-linear-methods.md21
-rw-r--r--docs/mllib-naive-bayes.md10
6 files changed, 155 insertions, 8 deletions
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md
index 5b9b4dd83b..8e91d62f4a 100644
--- a/docs/mllib-classification-regression.md
+++ b/docs/mllib-classification-regression.md
@@ -17,13 +17,13 @@ the supported algorithms for each type of problem.
</thead>
<tbody>
<tr>
- <td>Binary Classification</td><td>linear SVMs, logistic regression, decision trees, naive Bayes</td>
+ <td>Binary Classification</td><td>linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive Bayes</td>
</tr>
<tr>
- <td>Multiclass Classification</td><td>decision trees, naive Bayes</td>
+ <td>Multiclass Classification</td><td>decision trees, random forests, naive Bayes</td>
</tr>
<tr>
- <td>Regression</td><td>linear least squares, Lasso, ridge regression, decision trees, isotonic regression</td>
+ <td>Regression</td><td>linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression</td>
</tr>
</tbody>
</table>
@@ -34,5 +34,8 @@ More details for these methods can be found here:
* [binary classification (SVMs, logistic regression)](mllib-linear-methods.html#binary-classification)
* [linear regression (least squares, Lasso, ridge)](mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression)
* [Decision trees](mllib-decision-tree.html)
+* [Ensembles of decision trees](mllib-ensembles.html)
+ * [random forests](mllib-ensembles.html#random-forests)
+ * [gradient-boosted trees](mllib-ensembles.html#gradient-boosted-trees-gbts)
* [Naive Bayes](mllib-naive-bayes.html)
* [Isotonic regression](mllib-isotonic-regression.html)
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index ef18cec937..935cd8dad3 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -66,6 +66,7 @@ recommendation model by measuring the Mean Squared Error of rating prediction.
{% highlight scala %}
import org.apache.spark.mllib.recommendation.ALS
+import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
@@ -95,6 +96,9 @@ val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
err * err
}.mean()
println("Mean Squared Error = " + MSE)
+
+model.save("myModelPath")
+val sameModel = MatrixFactorizationModel.load("myModelPath")
{% endhighlight %}
If the rating matrix is derived from another source of information (e.g., it is inferred from
@@ -181,6 +185,9 @@ public class CollaborativeFiltering {
}
).rdd()).mean();
System.out.println("Mean Squared Error = " + MSE);
+
+ model.save("myModelPath");
+ MatrixFactorizationModel sameModel = MatrixFactorizationModel.load("myModelPath");
}
}
{% endhighlight %}
@@ -191,6 +198,8 @@ In the following example we load rating data. Each row consists of a user, a pro
We use the default ALS.train() method which assumes ratings are explicit. We evaluate the
recommendation by measuring the Mean Squared Error of rating prediction.
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.recommendation import ALS, Rating
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 6675133a81..4695d1cde4 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -194,6 +194,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -221,6 +222,9 @@ val labelAndPreds = testData.map { point =>
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = DecisionTreeModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -279,10 +283,16 @@ Double testErr =
}).count() / testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification tree model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+DecisionTreeModel sameModel = DecisionTreeModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
@@ -324,6 +334,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -350,6 +361,9 @@ val labelsAndPredictions = testData.map { point =>
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = DecisionTreeModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -414,10 +428,16 @@ Double testMSE =
}) / data.count();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression tree model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+DecisionTreeModel sameModel = DecisionTreeModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 00040e6073..ddae84165f 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -98,6 +98,7 @@ The test error is calculated to measure the algorithm accuracy.
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.RandomForest
+import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -127,6 +128,9 @@ val labelAndPreds = testData.map { point =>
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification forest model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = RandomForestModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -188,10 +192,16 @@ Double testErr =
}).count() / testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification forest model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+RandomForestModel sameModel = RandomForestModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
@@ -235,6 +245,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
<div data-lang="scala">
{% highlight scala %}
import org.apache.spark.mllib.tree.RandomForest
+import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -264,6 +275,9 @@ val labelsAndPredictions = testData.map { point =>
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression forest model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = RandomForestModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -328,10 +342,16 @@ Double testMSE =
}) / testData.count();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression forest model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+RandomForestModel sameModel = RandomForestModel.load("myModelPath");
{% endhighlight %}
</div>
<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
@@ -441,8 +461,6 @@ iterations.
### Examples
-GBTs currently have APIs in Scala and Java. Examples in both languages are shown below.
-
#### Classification
The example below demonstrates how to load a
@@ -457,6 +475,7 @@ The test error is calculated to measure the algorithm accuracy.
{% highlight scala %}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -484,6 +503,9 @@ val labelAndPreds = testData.map { point =>
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification GBT model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = GradientBoostedTreesModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -545,6 +567,38 @@ Double testErr =
}).count() / testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification GBT model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load("myModelPath");
+{% endhighlight %}
+</div>
+
+<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
+{% highlight python %}
+from pyspark.mllib.tree import GradientBoostedTrees
+from pyspark.mllib.util import MLUtils
+
+# Load and parse the data file.
+data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
+
+# Train a GradientBoostedTrees model.
+# Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
+# (b) Use more iterations in practice.
+model = GradientBoostedTrees.trainClassifier(trainingData,
+ categoricalFeaturesInfo={}, numIterations=3)
+
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
+print('Test Error = ' + str(testErr))
+print('Learned classification GBT model:')
+print(model.toDebugString())
{% endhighlight %}
</div>
@@ -565,6 +619,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
{% highlight scala %}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
@@ -591,6 +646,9 @@ val labelsAndPredictions = testData.map { point =>
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression GBT model:\n" + model.toDebugString)
+
+model.save("myModelPath")
+val sameModel = GradientBoostedTreesModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -658,6 +716,38 @@ Double testMSE =
}) / data.count();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression GBT model:\n" + model.toDebugString());
+
+model.save("myModelPath");
+GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load("myModelPath");
+{% endhighlight %}
+</div>
+
+<div data-lang="python">
+
+Note that the Python API does not yet support model save/load but will in the future.
+
+{% highlight python %}
+from pyspark.mllib.tree import GradientBoostedTrees
+from pyspark.mllib.util import MLUtils
+
+# Load and parse the data file.
+data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
+
+# Train a GradientBoostedTrees model.
+# Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
+# (b) Use more iterations in practice.
+model = GradientBoostedTrees.trainRegressor(trainingData,
+ categoricalFeaturesInfo={}, numIterations=3)
+
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
+print('Test Mean Squared Error = ' + str(testMSE))
+print('Learned regression GBT model:')
+print(model.toDebugString())
{% endhighlight %}
</div>
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 44b7f67c57..d9fc63b37d 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -190,7 +190,7 @@ error.
{% highlight scala %}
import org.apache.spark.SparkContext
-import org.apache.spark.mllib.classification.SVMWithSGD
+import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
@@ -222,6 +222,9 @@ val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
+
+model.save("myModelPath")
+val sameModel = SVMModel.load("myModelPath")
{% endhighlight %}
The `SVMWithSGD.train()` method by default performs L2 regularization with the
@@ -304,6 +307,9 @@ public class SVMClassifier {
double auROC = metrics.areaUnderROC();
System.out.println("Area under ROC = " + auROC);
+
+ model.save("myModelPath");
+ SVMModel sameModel = SVMModel.load("myModelPath");
}
}
{% endhighlight %}
@@ -338,6 +344,8 @@ a dependency.
The following example shows how to load a sample dataset, build Logistic Regression model,
and make predictions with the resulting model to compute the training error.
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
@@ -391,8 +399,9 @@ values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
{% highlight scala %}
-import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.regression.LinearRegressionModel
+import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
@@ -413,6 +422,9 @@ val valuesAndPreds = parsedData.map { point =>
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
+
+model.save("myModelPath")
+val sameModel = LinearRegressionModel.load("myModelPath")
{% endhighlight %}
[`RidgeRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
@@ -483,6 +495,9 @@ public class LinearRegression {
}
).rdd()).mean();
System.out.println("training Mean Squared Error = " + MSE);
+
+ model.save("myModelPath");
+ LinearRegressionModel sameModel = LinearRegressionModel.load("myModelPath");
}
}
{% endhighlight %}
@@ -494,6 +509,8 @@ The example then uses LinearRegressionWithSGD to build a simple linear model to
values. We compute the mean squared error at the end to evaluate
[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
+Note that the Python API does not yet support model save/load but will in the future.
+
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index d5b044d94f..81173255b5 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -37,7 +37,7 @@ smoothing parameter `lambda` as input, and output a
can be used for evaluation and prediction.
{% highlight scala %}
-import org.apache.spark.mllib.classification.NaiveBayes
+import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
@@ -55,6 +55,9 @@ val model = NaiveBayes.train(training, lambda = 1.0)
val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
+
+model.save("myModelPath")
+val sameModel = NaiveBayesModel.load("myModelPath")
{% endhighlight %}
</div>
@@ -93,6 +96,9 @@ double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>,
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
+
+model.save("myModelPath");
+NaiveBayesModel sameModel = NaiveBayesModel.load("myModelPath");
{% endhighlight %}
</div>
@@ -105,6 +111,8 @@ smoothing parameter `lambda` as input, and output a
[NaiveBayesModel](api/python/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
used for evaluation and prediction.
+Note that the Python API does not yet support model save/load but will in the future.
+
<!-- TODO: Make Python's example consistent with Scala's and Java's. -->
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint