aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/mllib-decision-tree.md241
-rw-r--r--docs/mllib-ensembles.md653
-rw-r--r--docs/mllib-guide.md29
-rw-r--r--examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostedTreesRunner.java2
-rw-r--r--examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestExample.java139
-rwxr-xr-xexamples/src/main/python/mllib/decision_tree_runner.py17
-rwxr-xr-xexamples/src/main/python/mllib/random_forest_example.py89
-rw-r--r--examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala2
-rw-r--r--examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostedTreesRunner.scala2
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala4
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala22
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala20
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala6
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala26
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala2
-rw-r--r--mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala46
-rw-r--r--mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala2
-rw-r--r--mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala14
-rw-r--r--python/pyspark/mllib/tree.py6
19 files changed, 1140 insertions, 182 deletions
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 12a6afbeea..fc8e732251 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Decision Tree
and their ensembles are popular methods for the machine learning tasks of
classification and regression. Decision trees are widely used since they are easy to interpret,
handle categorical features, extend to the multiclass classification setting, do not require
-feature scaling and are able to capture nonlinearities and feature interactions. Tree ensemble
+feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble
algorithms such as random forests and boosting are among the top performers for classification and
regression tasks.
@@ -19,6 +19,8 @@ MLlib supports decision trees for binary and multiclass classification and for r
using both continuous and categorical features. The implementation partitions data by rows,
allowing distributed training with millions of instances.
+Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](mllib-ensembles.html).
+
## Basic algorithm
The decision tree is a greedy algorithm that performs a recursive binary partitioning of the feature
@@ -42,18 +44,18 @@ impurity measure for regression (variance).
<tr>
<td>Gini impurity</td>
<td>Classification</td>
- <td>$\sum_{i=1}^{M} f_i(1-f_i)$</td><td>$f_i$ is the frequency of label $i$ at a node and $M$ is the number of unique labels.</td>
+ <td>$\sum_{i=1}^{C} f_i(1-f_i)$</td><td>$f_i$ is the frequency of label $i$ at a node and $C$ is the number of unique labels.</td>
</tr>
<tr>
<td>Entropy</td>
<td>Classification</td>
- <td>$\sum_{i=1}^{M} -f_ilog(f_i)$</td><td>$f_i$ is the frequency of label $i$ at a node and $M$ is the number of unique labels.</td>
+ <td>$\sum_{i=1}^{C} -f_ilog(f_i)$</td><td>$f_i$ is the frequency of label $i$ at a node and $C$ is the number of unique labels.</td>
</tr>
<tr>
<td>Variance</td>
<td>Regression</td>
- <td>$\frac{1}{n} \sum_{i=1}^{N} (x_i - \mu)^2$</td><td>$y_i$ is label for an instance,
- $N$ is the number of instances and $\mu$ is the mean given by $\frac{1}{N} \sum_{i=1}^n x_i$.</td>
+ <td>$\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2$</td><td>$y_i$ is label for an instance,
+ $N$ is the number of instances and $\mu$ is the mean given by $\frac{1}{N} \sum_{i=1}^N x_i$.</td>
</tr>
</tbody>
</table>
@@ -103,36 +105,73 @@ and the resulting `$M-1$` split candidates are considered.
### Stopping rule
-The recursive tree construction is stopped at a node when one of the two conditions is met:
+The recursive tree construction is stopped at a node when one of the following conditions is met:
1. The node depth is equal to the `maxDepth` training parameter.
-2. No split candidate leads to an information gain at the node.
+2. No split candidate leads to an information gain greater than `minInfoGain`.
+3. No split candidate produces child nodes which each have at least `minInstancesPerNode` training instances.
+
+## Usage tips
+
+We include a few guidelines for using decision trees by discussing the various parameters.
+The parameters are listed below roughly in order of descending importance. New users should mainly consider the "Problem specification parameters" section and the `maxDepth` parameter.
+
+### Problem specification parameters
+
+These parameters describe the problem you want to solve and your dataset.
+They should be specified and do not require tuning.
+
+* **`algo`**: `Classification` or `Regression`
+
+* **`numClasses`**: Number of classes (for `Classification` only)
+
+* **`categoricalFeaturesInfo`**: Specifies which features are categorical and how many categorical values each of those features can take. This is given as a map from feature indices to feature arity (number of categories). Any features not in this map are treated as continuous.
+ * E.g., `Map(0 -> 2, 4 -> 10)` specifies that feature `0` is binary (taking values `0` or `1`) and that feature `4` has 10 categories (values `{0, 1, ..., 9}`). Note that feature indices are 0-based: features `0` and `4` are the 1st and 5th elements of an instance's feature vector.
+ * Note that you do not have to specify `categoricalFeaturesInfo`. The algorithm will still run and may get reasonable results. However, performance should be better if categorical features are properly designated.
+
+### Stopping criteria
+
+These parameters determine when the tree stops building (adding new nodes).
+When tuning these parameters, be careful to validate on held-out test data to avoid overfitting.
+
+* **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
+
+* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) since those are often trained deeper than individual trees.
+
+* **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain).
+
+### Tunable parameters
-## Implementation details
+These parameters may be tuned. Be careful to validate on held-out test data when tuning in order to avoid overfitting.
-### Max memory requirements
+* **`maxBins`**: Number of bins used when discretizing continuous features.
+ * Increasing `maxBins` allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication.
+ * Note that the `maxBins` parameter must be at least the maximum number of categories `$M$` for any categorical feature.
-For faster processing, the decision tree algorithm performs simultaneous histogram computations for
-all nodes at each level of the tree. This could lead to high memory requirements at deeper levels
-of the tree, potentially leading to memory overflow errors. To alleviate this problem, a `maxMemoryInMB`
-training parameter specifies the maximum amount of memory at the workers (twice as much at the
-master) to be allocated to the histogram computation. The default value is conservatively chosen to
-be 256 MB to allow the decision algorithm to work in most scenarios. Once the memory requirements
-for a level-wise computation cross the `maxMemoryInMB` threshold, the node training tasks at each
-subsequent level are split into smaller tasks.
+* **`maxMemoryInMB`**: Amount of memory to be used for collecting sufficient statistics.
+ * The default value is conservatively chosen to be 256 MB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`.
+ * *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics.
-Note that, if you have a large amount of memory, increasing `maxMemoryInMB` can lead to faster
-training by requiring fewer passes over the data.
+* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
-### Binning feature values
+* **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter.
-Increasing `maxBins` allows the algorithm to consider more split candidates and make fine-grained
-split decisions. However, it also increases computation and communication.
+### Caching and checkpointing
-Note that the `maxBins` parameter must be at least the maximum number of categories `$M$` for
-any categorical feature.
+MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) when `numTrees` is set to be large.
-### Scaling
+* **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration.
+ * This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration).
+ * *Implementation details*: By default, the algorithm communicates the current model to executors so that executors can match training instances with tree nodes. When this setting is turned on, then the algorithm will instead cache this information.
+
+Node ID caching generates a sequence of RDDs (1 per iteration). This long lineage can cause performance problems, but checkpointing intermediate RDDs can alleviate those problems.
+Note that checkpointing is only applicable when `useNodeIdCache` is set to true.
+
+* **`checkpointDir`**: Directory for checkpointing node ID cache RDDs.
+
+* **`checkpointInterval`**: Frequency for checkpointing node ID cache RDDs. Setting this too low will cause extra overhead from writing to HDFS; setting this too high can cause problems if executors fail and the RDD needs to be recomputed.
+
+## Scaling
Computation scales approximately linearly in the number of training instances,
in the number of features, and in the `maxBins` parameter.
@@ -148,7 +187,7 @@ The example below demonstrates how to load a
[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
parse it as an RDD of `LabeledPoint` and then
perform classification using a decision tree with Gini impurity as an impurity measure and a
-maximum tree depth of 5. The training error is calculated to measure the algorithm accuracy.
+maximum tree depth of 5. The test error is calculated to measure the algorithm accuracy.
<div class="codetabs">
@@ -158,8 +197,10 @@ import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
-// Cache the data since we will use it again to compute training error.
-val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").cache()
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+// Split the data into training and test sets (30% held out for testing)
+val splits = data.randomSplit(Array(0.7, 0.3))
+val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
@@ -169,17 +210,17 @@ val impurity = "gini"
val maxDepth = 5
val maxBins = 32
-val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity,
- maxDepth, maxBins)
+val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
+ impurity, maxDepth, maxBins)
-// Evaluate model on training instances and compute training error
-val labelAndPreds = data.map { point =>
+// Evaluate model on test instances and compute test error
+val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
-val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / data.count
-println("Training Error = " + trainErr)
-println("Learned classification tree model:\n" + model)
+val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
+println("Test Error = " + testErr)
+println("Learned classification tree model:\n" + model.toDebugString)
{% endhighlight %}
</div>
@@ -187,7 +228,6 @@ println("Learned classification tree model:\n" + model)
{% highlight java %}
import java.util.HashMap;
import scala.Tuple2;
-import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
@@ -203,37 +243,42 @@ SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// Load and parse the data file.
-// Cache the data since we will use it again to compute training error.
String datapath = "data/mllib/sample_libsvm_data.txt";
-JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD().cache();
+JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
+// Split the data into training and test sets (30% held out for testing)
+JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
+JavaRDD<LabeledPoint> trainingData = splits[0];
+JavaRDD<LabeledPoint> testData = splits[1];
// Set parameters.
// Empty categoricalFeaturesInfo indicates all features are continuous.
Integer numClasses = 2;
-HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
String impurity = "gini";
Integer maxDepth = 5;
Integer maxBins = 32;
// Train a DecisionTree model for classification.
-final DecisionTreeModel model = DecisionTree.trainClassifier(data, numClasses,
+final DecisionTreeModel model = DecisionTree.trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, impurity, maxDepth, maxBins);
-// Evaluate model on training instances and compute training error
+// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
- data.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
- @Override public Tuple2<Double, Double> call(LabeledPoint p) {
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
-Double trainErr =
+Double testErr =
1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
- @Override public Boolean call(Tuple2<Double, Double> pl) {
+ @Override
+ public Boolean call(Tuple2<Double, Double> pl) {
return !pl._1().equals(pl._2());
}
- }).count() / data.count();
-System.out.println("Training error: " + trainErr);
-System.out.println("Learned classification tree model:\n" + model);
+ }).count() / testData.count();
+System.out.println("Test Error: " + testErr);
+System.out.println("Learned classification tree model:\n" + model.toDebugString());
{% endhighlight %}
</div>
@@ -244,26 +289,23 @@ from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
-# Cache the data since we will use it again to compute training error.
-data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt').cache()
+data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
-model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={},
+model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
-# Evaluate model on training instances and compute training error
-predictions = model.predict(data.map(lambda x: x.features))
-labelsAndPredictions = data.map(lambda lp: lp.label).zip(predictions)
-trainErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(data.count())
-print('Training Error = ' + str(trainErr))
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
+print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
-print(model)
+print(model.toDebugString())
{% endhighlight %}
-
-Note: When making predictions for a dataset, it is more efficient to do batch prediction rather
-than separately calling `predict` on each data point. This is because the Python code makes calls
-to an underlying `DecisionTree` model in Scala.
</div>
</div>
@@ -285,8 +327,10 @@ import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
-// Cache the data since we will use it again to compute training error.
-val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").cache()
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+// Split the data into training and test sets (30% held out for testing)
+val splits = data.randomSplit(Array(0.7, 0.3))
+val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
@@ -295,17 +339,17 @@ val impurity = "variance"
val maxDepth = 5
val maxBins = 32
-val model = DecisionTree.trainRegressor(data, categoricalFeaturesInfo, impurity,
+val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity,
maxDepth, maxBins)
-// Evaluate model on training instances and compute training error
-val labelsAndPredictions = data.map { point =>
+// Evaluate model on test instances and compute test error
+val labelsAndPredictions = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
-val trainMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
-println("Training Mean Squared Error = " + trainMSE)
-println("Learned regression tree model:\n" + model)
+val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
+println("Test Mean Squared Error = " + testMSE)
+println("Learned regression tree model:\n" + model.toDebugString)
{% endhighlight %}
</div>
@@ -325,45 +369,51 @@ import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.SparkConf;
-// Load and parse the data file.
-// Cache the data since we will use it again to compute training error.
-String datapath = "data/mllib/sample_libsvm_data.txt";
-JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD().cache();
-
SparkConf sparkConf = new SparkConf().setAppName("JavaDecisionTree");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
+// Load and parse the data file.
+String datapath = "data/mllib/sample_libsvm_data.txt";
+JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
+// Split the data into training and test sets (30% held out for testing)
+JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
+JavaRDD<LabeledPoint> trainingData = splits[0];
+JavaRDD<LabeledPoint> testData = splits[1];
+
// Set parameters.
// Empty categoricalFeaturesInfo indicates all features are continuous.
-HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
String impurity = "variance";
Integer maxDepth = 5;
Integer maxBins = 32;
// Train a DecisionTree model.
-final DecisionTreeModel model = DecisionTree.trainRegressor(data,
+final DecisionTreeModel model = DecisionTree.trainRegressor(trainingData,
categoricalFeaturesInfo, impurity, maxDepth, maxBins);
-// Evaluate model on training instances and compute training error
+// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
- data.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
- @Override public Tuple2<Double, Double> call(LabeledPoint p) {
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
}
});
-Double trainMSE =
+Double testMSE =
predictionAndLabel.map(new Function<Tuple2<Double, Double>, Double>() {
- @Override public Double call(Tuple2<Double, Double> pl) {
+ @Override
+ public Double call(Tuple2<Double, Double> pl) {
Double diff = pl._1() - pl._2();
return diff * diff;
}
}).reduce(new Function2<Double, Double, Double>() {
- @Override public Double call(Double a, Double b) {
+ @Override
+ public Double call(Double a, Double b) {
return a + b;
}
}) / data.count();
-System.out.println("Training Mean Squared Error: " + trainMSE);
-System.out.println("Learned regression tree model:\n" + model);
+System.out.println("Test Mean Squared Error: " + testMSE);
+System.out.println("Learned regression tree model:\n" + model.toDebugString());
{% endhighlight %}
</div>
@@ -374,26 +424,23 @@ from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
-# Cache the data since we will use it again to compute training error.
-data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt').cache()
+data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
-model = DecisionTree.trainRegressor(data, categoricalFeaturesInfo={},
+model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo={},
impurity='variance', maxDepth=5, maxBins=32)
-# Evaluate model on training instances and compute training error
-predictions = model.predict(data.map(lambda x: x.features))
-labelsAndPredictions = data.map(lambda lp: lp.label).zip(predictions)
-trainMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(data.count())
-print('Training Mean Squared Error = ' + str(trainMSE))
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
+print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression tree model:')
-print(model)
+print(model.toDebugString())
{% endhighlight %}
-
-Note: When making predictions for a dataset, it is more efficient to do batch prediction rather
-than separately calling `predict` on each data point. This is because the Python code makes calls
-to an underlying `DecisionTree` model in Scala.
</div>
</div>
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
new file mode 100644
index 0000000000..23ede04b62
--- /dev/null
+++ b/docs/mllib-ensembles.md
@@ -0,0 +1,653 @@
+---
+layout: global
+title: Ensembles - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Ensembles
+---
+
+* Table of contents
+{:toc}
+
+An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
+is a learning algorithm which creates a model composed of a set of other base models.
+MLlib supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBosotedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest).
+Both use [decision trees](mllib-decision-tree.html) as their base models.
+
+## Gradient-Boosted Trees vs. Random Forests
+
+Both [Gradient-Boosted Trees (GBTs)](mllib-ensembles.html#Gradient-Boosted-Trees-(GBTS)) and [Random Forests](mllib-ensembles.html#Random-Forests) are algorithms for learning ensembles of trees, but the training processes are different. There are several practical trade-offs:
+
+ * GBTs train one tree at a time, so they can take longer to train than random forests. Random Forests can train multiple trees in parallel.
+ * On the other hand, it is often reasonable to use smaller (shallower) trees with GBTs than with Random Forests, and training smaller trees takes less time.
+ * Random Forests can be less prone to overfitting. Training more trees in a Random Forest reduces the likelihood of overfitting, but training more trees with GBTs increases the likelihood of overfitting. (In statistical language, Random Forests reduce variance by using more trees, whereas GBTs reduce bias by using more trees.)
+ * Random Forests can be easier to tune since performance improves monotonically with the number of trees (whereas performance can start to decrease for GBTs if the number of trees grows too large).
+
+In short, both algorithms can be effective, and the choice should be based on the particular dataset.
+
+## Random Forests
+
+[Random forests](http://en.wikipedia.org/wiki/Random_forest)
+are ensembles of [decision trees](mllib-decision-tree.html).
+Random forests are one of the most successful machine learning models for classification and
+regression. They combine many decision trees in order to reduce the risk of overfitting.
+Like decision trees, random forests handle categorical features,
+extend to the multiclass classification setting, do not require
+feature scaling, and are able to capture non-linearities and feature interactions.
+
+MLlib supports random forests for binary and multiclass classification and for regression,
+using both continuous and categorical features.
+MLlib implements random forests using the existing [decision tree](mllib-decision-tree.html)
+implementation. Please see the decision tree guide for more information on trees.
+
+### Basic algorithm
+
+Random forests train a set of decision trees separately, so the training can be done in parallel.
+The algorithm injects randomness into the training process so that each decision tree is a bit
+different. Combining the predictions from each tree reduces the variance of the predictions,
+improving the performance on test data.
+
+#### Training
+
+The randomness injected into the training process includes:
+
+* Subsampling the original dataset on each iteration to get a different training set (a.k.a. bootstrapping).
+* Considering different random subsets of features to split on at each tree node.
+
+Apart from these randomizations, decision tree training is done in the same way as for individual decision trees.
+
+#### Prediction
+
+To make a prediction on a new instance, a random forest must aggregate the predictions from its set of decision trees. This aggregation is done differently for classification and regression.
+
+*Classification*: Majority vote. Each tree's prediction is counted as a vote for one class. The label is predicted to be the class which receives the most votes.
+
+*Regression*: Averaging. Each tree predicts a real value. The label is predicted to be the average of the tree predictions.
+
+### Usage tips
+
+We include a few guidelines for using random forests by discussing the various parameters.
+We omit some decision tree parameters since those are covered in the [decision tree guide](mllib-decision-tree.html).
+
+The first two parameters we mention are the most important, and tuning them can often improve performance:
+
+* **`numTrees`**: Number of trees in the forest.
+ * Increasing the number of trees will decrease the variance in predictions, improving the model's test-time accuracy.
+ * Training time increases roughly linearly in the number of trees.
+
+* **`maxDepth`**: Maximum depth of each tree in the forest.
+ * Increasing the depth makes the model more expressive and powerful. However, deep trees take longer to train and are also more prone to overfitting.
+ * In general, it is acceptable to train deeper trees when using random forests than when using a single decision tree. One tree is more likely to overfit than a random forest (because of the variance reduction from averaging multiple trees in the forest).
+
+The next two parameters generally do not require tuning. However, they can be tuned to speed up training.
+
+* **`subsamplingRate`**: This parameter specifies the size of the dataset used for training each tree in the forest, as a fraction of the size of the original dataset. The default (1.0) is recommended, but decreasing this fraction can speed up training.
+
+* **`featureSubsetStrategy`**: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.
+
+### Examples
+
+#### Classification
+
+The example below demonstrates how to load a
+[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
+parse it as an RDD of `LabeledPoint` and then
+perform classification using a Random Forest.
+The test error is calculated to measure the algorithm accuracy.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark.mllib.tree.RandomForest
+import org.apache.spark.mllib.util.MLUtils
+
+// Load and parse the data file.
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+// Split the data into training and test sets (30% held out for testing)
+val splits = data.randomSplit(Array(0.7, 0.3))
+val (trainingData, testData) = (splits(0), splits(1))
+
+// Train a RandomForest model.
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+val numClasses = 2
+val categoricalFeaturesInfo = Map[Int, Int]()
+val numTrees = 3 // Use more in practice.
+val featureSubsetStrategy = "auto" // Let the algorithm choose.
+val impurity = "gini"
+val maxDepth = 4
+val maxBins = 32
+
+val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
+ numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
+
+// Evaluate model on test instances and compute test error
+val labelAndPreds = testData.map { point =>
+ val prediction = model.predict(point.features)
+ (point.label, prediction)
+}
+val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
+println("Test Error = " + testErr)
+println("Learned classification forest model:\n" + model.toDebugString)
+{% endhighlight %}
+</div>
+
+<div data-lang="java">
+{% highlight java %}
+import scala.Tuple2;
+import java.util.HashMap;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.tree.RandomForest;
+import org.apache.spark.mllib.tree.model.RandomForestModel;
+import org.apache.spark.mllib.util.MLUtils;
+
+SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForestClassification");
+JavaSparkContext sc = new JavaSparkContext(sparkConf);
+
+// Load and parse the data file.
+String datapath = "data/mllib/sample_libsvm_data.txt";
+JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
+// Split the data into training and test sets (30% held out for testing)
+JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
+JavaRDD<LabeledPoint> trainingData = splits[0];
+JavaRDD<LabeledPoint> testData = splits[1];
+
+// Train a RandomForest model.
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+Integer numClasses = 2;
+HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+Integer numTrees = 3; // Use more in practice.
+String featureSubsetStrategy = "auto"; // Let the algorithm choose.
+String impurity = "gini";
+Integer maxDepth = 5;
+Integer maxBins = 32;
+Integer seed = 12345;
+
+final RandomForestModel model = RandomForest.trainClassifier(trainingData, numClasses,
+ categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,
+ seed);
+
+// Evaluate model on test instances and compute test error
+JavaPairRDD<Double, Double> predictionAndLabel =
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
+ return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
+ }
+ });
+Double testErr =
+ 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
+ @Override
+ public Boolean call(Tuple2<Double, Double> pl) {
+ return !pl._1().equals(pl._2());
+ }
+ }).count() / testData.count();
+System.out.println("Test Error: " + testErr);
+System.out.println("Learned classification forest model:\n" + model.toDebugString());
+{% endhighlight %}
+</div>
+
+<div data-lang="python">
+{% highlight python %}
+from pyspark.mllib.tree import RandomForest
+from pyspark.mllib.util import MLUtils
+
+# Load and parse the data file into an RDD of LabeledPoint.
+data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
+
+# Train a RandomForest model.
+# Empty categoricalFeaturesInfo indicates all features are continuous.
+# Note: Use larger numTrees in practice.
+# Setting featureSubsetStrategy="auto" lets the algorithm choose.
+model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
+ numTrees=3, featureSubsetStrategy="auto",
+ impurity='gini', maxDepth=4, maxBins=32)
+
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
+print('Test Error = ' + str(testErr))
+print('Learned classification forest model:')
+print(model.toDebugString())
+{% endhighlight %}
+</div>
+
+</div>
+
+#### Regression
+
+The example below demonstrates how to load a
+[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
+parse it as an RDD of `LabeledPoint` and then
+perform regression using a Random Forest.
+The Mean Squared Error (MSE) is computed at the end to evaluate
+[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark.mllib.tree.RandomForest
+import org.apache.spark.mllib.util.MLUtils
+
+// Load and parse the data file.
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+// Split the data into training and test sets (30% held out for testing)
+val splits = data.randomSplit(Array(0.7, 0.3))
+val (trainingData, testData) = (splits(0), splits(1))
+
+// Train a RandomForest model.
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+val numClasses = 2
+val categoricalFeaturesInfo = Map[Int, Int]()
+val numTrees = 3 // Use more in practice.
+val featureSubsetStrategy = "auto" // Let the algorithm choose.
+val impurity = "variance"
+val maxDepth = 4
+val maxBins = 32
+
+val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo,
+ numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
+
+// Evaluate model on test instances and compute test error
+val labelsAndPredictions = testData.map { point =>
+ val prediction = model.predict(point.features)
+ (point.label, prediction)
+}
+val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
+println("Test Mean Squared Error = " + testMSE)
+println("Learned regression forest model:\n" + model.toDebugString)
+{% endhighlight %}
+</div>
+
+<div data-lang="java">
+{% highlight java %}
+import java.util.HashMap;
+import scala.Tuple2;
+import org.apache.spark.api.java.function.Function2;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.tree.RandomForest;
+import org.apache.spark.mllib.tree.model.RandomForestModel;
+import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.SparkConf;
+
+SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForest");
+JavaSparkContext sc = new JavaSparkContext(sparkConf);
+
+// Load and parse the data file.
+String datapath = "data/mllib/sample_libsvm_data.txt";
+JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
+// Split the data into training and test sets (30% held out for testing)
+JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
+JavaRDD<LabeledPoint> trainingData = splits[0];
+JavaRDD<LabeledPoint> testData = splits[1];
+
+// Set parameters.
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+String impurity = "variance";
+Integer maxDepth = 4;
+Integer maxBins = 32;
+
+// Train a RandomForest model.
+final RandomForestModel model = RandomForest.trainRegressor(trainingData,
+ categoricalFeaturesInfo, impurity, maxDepth, maxBins);
+
+// Evaluate model on test instances and compute test error
+JavaPairRDD<Double, Double> predictionAndLabel =
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
+ return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
+ }
+ });
+Double testMSE =
+ predictionAndLabel.map(new Function<Tuple2<Double, Double>, Double>() {
+ @Override
+ public Double call(Tuple2<Double, Double> pl) {
+ Double diff = pl._1() - pl._2();
+ return diff * diff;
+ }
+ }).reduce(new Function2<Double, Double, Double>() {
+ @Override
+ public Double call(Double a, Double b) {
+ return a + b;
+ }
+ }) / testData.count();
+System.out.println("Test Mean Squared Error: " + testMSE);
+System.out.println("Learned regression forest model:\n" + model.toDebugString());
+{% endhighlight %}
+</div>
+
+<div data-lang="python">
+{% highlight python %}
+from pyspark.mllib.tree import RandomForest
+from pyspark.mllib.util import MLUtils
+
+# Load and parse the data file into an RDD of LabeledPoint.
+data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
+# Split the data into training and test sets (30% held out for testing)
+(trainingData, testData) = data.randomSplit([0.7, 0.3])
+
+# Train a RandomForest model.
+# Empty categoricalFeaturesInfo indicates all features are continuous.
+# Note: Use larger numTrees in practice.
+# Setting featureSubsetStrategy="auto" lets the algorithm choose.
+model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
+ numTrees=3, featureSubsetStrategy="auto",
+ impurity='variance', maxDepth=4, maxBins=32)
+
+# Evaluate model on test instances and compute test error
+predictions = model.predict(testData.map(lambda x: x.features))
+labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum() / float(testData.count())
+print('Test Mean Squared Error = ' + str(testMSE))
+print('Learned regression forest model:')
+print(model.toDebugString())
+{% endhighlight %}
+</div>
+
+</div>
+
+## Gradient-Boosted Trees (GBTs)
+
+[Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
+are ensembles of [decision trees](mllib-decision-tree.html).
+GBTs iteratively train decision trees in order to minimize a loss function.
+Like decision trees, GBTs handle categorical features,
+extend to the multiclass classification setting, do not require
+feature scaling, and are able to capture non-linearities and feature interactions.
+
+MLlib supports GBTs for binary classification and for regression,
+using both continuous and categorical features.
+MLlib implements GBTs using the existing [decision tree](mllib-decision-tree.html) implementation. Please see the decision tree guide for more information on trees.
+
+*Note*: GBTs do not yet support multiclass classification. For multiclass problems, please use
+[decision trees](mllib-decision-tree.html) or [Random Forests](mllib-ensembles.html#Random-Forest).
+
+### Basic algorithm
+
+Gradient boosting iteratively trains a sequence of decision trees.
+On each iteration, the algorithm uses the current ensemble to predict the label of each training instance and then compares the prediction with the true label. The dataset is re-labeled to put more emphasis on training instances with poor predictions. Thus, in the next iteration, the decision tree will help correct for previous mistakes.
+
+The specific mechanism for re-labeling instances is defined by a loss function (discussed below). With each iteration, GBTs further reduce this loss function on the training data.
+
+#### Losses
+
+The table below lists the losses currently supported by GBTs in MLlib.
+Note that each loss is applicable to one of classification or regression, not both.
+
+Notation: $N$ = number of instances. $y_i$ = label of instance $i$. $x_i$ = features of instance $i$. $F(x_i)$ = model's predicted label for instance $i$.
+
+<table class="table">
+ <thead>
+ <tr><th>Loss</th><th>Task</th><th>Formula</th><th>Description</th></tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>Log Loss</td>
+ <td>Classification</td>
+ <td>$2 \sum_{i=1}^{N} \log(1+\exp(-2 y_i F(x_i)))$</td><td>Twice binomial negative log likelihood.</td>
+ </tr>
+ <tr>
+ <td>Squared Error</td>
+ <td>Regression</td>
+ <td>$\sum_{i=1}^{N} (y_i - F(x_i))^2$</td><td>Also called L2 loss. Default loss for regression tasks.</td>
+ </tr>
+ <tr>
+ <td>Absolute Error</td>
+ <td>Regression</td>
+ <td>$\sum_{i=1}^{N} |y_i - F(x_i)|$</td><td>Also called L1 loss. Can be more robust to outliers than Squared Error.</td>
+ </tr>
+ </tbody>
+</table>
+
+### Usage tips
+
+We include a few guidelines for using GBTs by discussing the various parameters.
+We omit some decision tree parameters since those are covered in the [decision tree guide](mllib-decision-tree.html).
+
+* **`loss`**: See the section above for information on losses and their applicability to tasks (classification vs. regression). Different losses can give significantly different results, depending on the dataset.
+
+* **`numIterations`**: This sets the number of trees in the ensemble. Each iteration produces one tree. Increasing this number makes the model more expressive, improving training data accuracy. However, test-time accuracy may suffer if this is too large.
+
+* **`learningRate`**: This parameter should not need to be tuned. If the algorithm behavior seems unstable, decreasing this value may improve stability.
+
+* **`algo`**: The algorithm or task (classification vs. regression) is set using the tree [Strategy] parameter.
+
+
+### Examples
+
+GBTs currently have APIs in Scala and Java. Examples in both languages are shown below.
+
+#### Classification
+
+The example below demonstrates how to load a
+[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
+parse it as an RDD of `LabeledPoint` and then
+perform classification using Gradient-Boosted Trees with log loss.
+The test error is calculated to measure the algorithm accuracy.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark.mllib.tree.GradientBoostedTrees
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.mllib.util.MLUtils
+
+// Load and parse the data file.
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+// Split the data into training and test sets (30% held out for testing)
+val splits = data.randomSplit(Array(0.7, 0.3))
+val (trainingData, testData) = (splits(0), splits(1))
+
+// Train a GradientBoostedTrees model.
+// The defaultParams for Classification use LogLoss by default.
+val boostingStrategy = BoostingStrategy.defaultParams("Classification")
+boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.
+boostingStrategy.treeStrategy.numClassesForClassification = 2
+boostingStrategy.treeStrategy.maxDepth = 5
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
+
+val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
+
+// Evaluate model on test instances and compute test error
+val labelAndPreds = testData.map { point =>
+ val prediction = model.predict(point.features)
+ (point.label, prediction)
+}
+val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
+println("Test Error = " + testErr)
+println("Learned classification GBT model:\n" + model.toDebugString)
+{% endhighlight %}
+</div>
+
+<div data-lang="java">
+{% highlight java %}
+import scala.Tuple2;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.tree.GradientBoostedTrees;
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy;
+import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel;
+import org.apache.spark.mllib.util.MLUtils;
+
+SparkConf sparkConf = new SparkConf().setAppName("JavaGradientBoostedTrees");
+JavaSparkContext sc = new JavaSparkContext(sparkConf);
+
+// Load and parse the data file.
+String datapath = "data/mllib/sample_libsvm_data.txt";
+JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
+// Split the data into training and test sets (30% held out for testing)
+JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
+JavaRDD<LabeledPoint> trainingData = splits[0];
+JavaRDD<LabeledPoint> testData = splits[1];
+
+// Train a GradientBoostedTrees model.
+// The defaultParams for Classification use LogLoss by default.
+BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
+boostingStrategy.setNumIterations(3); // Note: Use more iterations in practice.
+boostingStrategy.getTreeStrategy().setNumClassesForClassification(2);
+boostingStrategy.getTreeStrategy().setMaxDepth(5);
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
+
+final GradientBoostedTreesModel model =
+ GradientBoostedTrees.train(trainingData, boostingStrategy);
+
+// Evaluate model on test instances and compute test error
+JavaPairRDD<Double, Double> predictionAndLabel =
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
+ return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
+ }
+ });
+Double testErr =
+ 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
+ @Override
+ public Boolean call(Tuple2<Double, Double> pl) {
+ return !pl._1().equals(pl._2());
+ }
+ }).count() / testData.count();
+System.out.println("Test Error: " + testErr);
+System.out.println("Learned classification GBT model:\n" + model.toDebugString());
+{% endhighlight %}
+</div>
+
+</div>
+
+#### Regression
+
+The example below demonstrates how to load a
+[LIBSVM data file](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/),
+parse it as an RDD of `LabeledPoint` and then
+perform regression using Gradient-Boosted Trees with Squared Error as the loss.
+The Mean Squared Error (MSE) is computed at the end to evaluate
+[goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit).
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% highlight scala %}
+import org.apache.spark.mllib.tree.GradientBoostedTrees
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.mllib.util.MLUtils
+
+// Load and parse the data file.
+val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
+// Split the data into training and test sets (30% held out for testing)
+val splits = data.randomSplit(Array(0.7, 0.3))
+val (trainingData, testData) = (splits(0), splits(1))
+
+// Train a GradientBoostedTrees model.
+// The defaultParams for Regression use SquaredError by default.
+val boostingStrategy = BoostingStrategy.defaultParams("Regression")
+boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.
+boostingStrategy.treeStrategy.maxDepth = 5
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
+
+val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
+
+// Evaluate model on test instances and compute test error
+val labelsAndPredictions = testData.map { point =>
+ val prediction = model.predict(point.features)
+ (point.label, prediction)
+}
+val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
+println("Test Mean Squared Error = " + testMSE)
+println("Learned regression GBT model:\n" + model.toDebugString)
+{% endhighlight %}
+</div>
+
+<div data-lang="java">
+{% highlight java %}
+import scala.Tuple2;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.function.Function2;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.tree.GradientBoostedTrees;
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy;
+import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel;
+import org.apache.spark.mllib.util.MLUtils;
+
+SparkConf sparkConf = new SparkConf().setAppName("JavaGradientBoostedTrees");
+JavaSparkContext sc = new JavaSparkContext(sparkConf);
+
+// Load and parse the data file.
+String datapath = "data/mllib/sample_libsvm_data.txt";
+JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
+// Split the data into training and test sets (30% held out for testing)
+JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
+JavaRDD<LabeledPoint> trainingData = splits[0];
+JavaRDD<LabeledPoint> testData = splits[1];
+
+// Train a GradientBoostedTrees model.
+// The defaultParams for Regression use SquaredError by default.
+BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Regression");
+boostingStrategy.setNumIterations(3); // Note: Use more iterations in practice.
+boostingStrategy.getTreeStrategy().setMaxDepth(5);
+// Empty categoricalFeaturesInfo indicates all features are continuous.
+Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
+
+final GradientBoostedTreesModel model =
+ GradientBoostedTrees.train(trainingData, boostingStrategy);
+
+// Evaluate model on test instances and compute test error
+JavaPairRDD<Double, Double> predictionAndLabel =
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
+ return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
+ }
+ });
+Double testMSE =
+ predictionAndLabel.map(new Function<Tuple2<Double, Double>, Double>() {
+ @Override
+ public Double call(Tuple2<Double, Double> pl) {
+ Double diff = pl._1() - pl._2();
+ return diff * diff;
+ }
+ }).reduce(new Function2<Double, Double, Double>() {
+ @Override
+ public Double call(Double a, Double b) {
+ return a + b;
+ }
+ }) / data.count();
+System.out.println("Test Mean Squared Error: " + testMSE);
+System.out.println("Learned regression GBT model:\n" + model.toDebugString());
+{% endhighlight %}
+</div>
+
+</div>
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 94fc98ce4f..dcb6819f46 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -16,8 +16,9 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv
* random data generation
* [Classification and regression](mllib-classification-regression.html)
* [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
- * [decision trees](mllib-decision-tree.html)
* [naive Bayes](mllib-naive-bayes.html)
+ * [decision trees](mllib-decision-tree.html)
+ * [ensembles of trees](mllib-ensembles.html) (Random Forests and Gradient-Boosted Trees)
* [Collaborative filtering](mllib-collaborative-filtering.html)
* alternating least squares (ALS)
* [Clustering](mllib-clustering.html)
@@ -60,6 +61,32 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4
# Migration Guide
+## From 1.1 to 1.2
+
+The only API changes in MLlib v1.2 are in
+[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+which continues to be an experimental API in MLlib 1.2:
+
+1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number
+of classes. In MLlib v1.1, this argument was called `numClasses` in Python and
+`numClassesForClassification` in Scala. In MLlib v1.2, the names are both set to `numClasses`.
+This `numClasses` parameter is specified either via
+[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
+or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
+static `trainClassifier` and `trainRegressor` methods.
+
+2. *(Breaking change)* The API for
+[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has changed.
+This should generally not affect user code, unless the user manually constructs decision trees
+(instead of using the `trainClassifier` or `trainRegressor` methods).
+The tree `Node` now includes more information, including the probability of the predicted label
+(for classification).
+
+3. Printing methods' output has changed. The `toString` (Scala/Java) and `__repr__` (Python) methods used to print the full model; they now print a summary. For the full model, use `toDebugString`.
+
+Examples in the Spark distribution and examples in the
+[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated accordingly.
+
## From 1.0 to 1.1
The only API changes in MLlib v1.1 are in
diff --git a/examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostedTreesRunner.java b/examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostedTreesRunner.java
index 4a5ac404ea..a1844d5d07 100644
--- a/examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostedTreesRunner.java
+++ b/examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostedTreesRunner.java
@@ -73,7 +73,7 @@ public final class JavaGradientBoostedTreesRunner {
return p.label();
}
}).countByValue().size();
- boostingStrategy.treeStrategy().setNumClassesForClassification(numClasses);
+ boostingStrategy.treeStrategy().setNumClasses(numClasses);
// Train a GradientBoosting model for classification.
final GradientBoostedTreesModel model = GradientBoostedTrees.train(data, boostingStrategy);
diff --git a/examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestExample.java b/examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestExample.java
new file mode 100644
index 0000000000..89a4e092a5
--- /dev/null
+++ b/examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestExample.java
@@ -0,0 +1,139 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib;
+
+import scala.Tuple2;
+
+import java.util.HashMap;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.Function2;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.tree.RandomForest;
+import org.apache.spark.mllib.tree.model.RandomForestModel;
+import org.apache.spark.mllib.util.MLUtils;
+
+public final class JavaRandomForestExample {
+
+ /**
+ * Note: This example illustrates binary classification.
+ * For information on multiclass classification, please refer to the JavaDecisionTree.java
+ * example.
+ */
+ private static void testClassification(JavaRDD<LabeledPoint> trainingData,
+ JavaRDD<LabeledPoint> testData) {
+ // Train a RandomForest model.
+ // Empty categoricalFeaturesInfo indicates all features are continuous.
+ Integer numClasses = 2;
+ HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+ Integer numTrees = 3; // Use more in practice.
+ String featureSubsetStrategy = "auto"; // Let the algorithm choose.
+ String impurity = "gini";
+ Integer maxDepth = 4;
+ Integer maxBins = 32;
+ Integer seed = 12345;
+
+ final RandomForestModel model = RandomForest.trainClassifier(trainingData, numClasses,
+ categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,
+ seed);
+
+ // Evaluate model on test instances and compute test error
+ JavaPairRDD<Double, Double> predictionAndLabel =
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
+ return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
+ }
+ });
+ Double testErr =
+ 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
+ @Override
+ public Boolean call(Tuple2<Double, Double> pl) {
+ return !pl._1().equals(pl._2());
+ }
+ }).count() / testData.count();
+ System.out.println("Test Error: " + testErr);
+ System.out.println("Learned classification forest model:\n" + model.toDebugString());
+ }
+
+ private static void testRegression(JavaRDD<LabeledPoint> trainingData,
+ JavaRDD<LabeledPoint> testData) {
+ // Train a RandomForest model.
+ // Empty categoricalFeaturesInfo indicates all features are continuous.
+ HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>();
+ Integer numTrees = 3; // Use more in practice.
+ String featureSubsetStrategy = "auto"; // Let the algorithm choose.
+ String impurity = "variance";
+ Integer maxDepth = 4;
+ Integer maxBins = 32;
+ Integer seed = 12345;
+
+ final RandomForestModel model = RandomForest.trainRegressor(trainingData,
+ categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,
+ seed);
+
+ // Evaluate model on test instances and compute test error
+ JavaPairRDD<Double, Double> predictionAndLabel =
+ testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+ @Override
+ public Tuple2<Double, Double> call(LabeledPoint p) {
+ return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
+ }
+ });
+ Double testMSE =
+ predictionAndLabel.map(new Function<Tuple2<Double, Double>, Double>() {
+ @Override
+ public Double call(Tuple2<Double, Double> pl) {
+ Double diff = pl._1() - pl._2();
+ return diff * diff;
+ }
+ }).reduce(new Function2<Double, Double, Double>() {
+ @Override
+ public Double call(Double a, Double b) {
+ return a + b;
+ }
+ }) / testData.count();
+ System.out.println("Test Mean Squared Error: " + testMSE);
+ System.out.println("Learned regression forest model:\n" + model.toDebugString());
+ }
+
+ public static void main(String[] args) {
+ SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForestExample");
+ JavaSparkContext sc = new JavaSparkContext(sparkConf);
+
+ // Load and parse the data file.
+ String datapath = "data/mllib/sample_libsvm_data.txt";
+ JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc.sc(), datapath).toJavaRDD();
+ // Split the data into training and test sets (30% held out for testing)
+ JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
+ JavaRDD<LabeledPoint> trainingData = splits[0];
+ JavaRDD<LabeledPoint> testData = splits[1];
+
+ System.out.println("\nRunning example of classification using RandomForest\n");
+ testClassification(trainingData, testData);
+
+ System.out.println("\nRunning example of regression using RandomForest\n");
+ testRegression(trainingData, testData);
+ sc.stop();
+ }
+}
diff --git a/examples/src/main/python/mllib/decision_tree_runner.py b/examples/src/main/python/mllib/decision_tree_runner.py
index 61ea4e06ec..fccabd841b 100755
--- a/examples/src/main/python/mllib/decision_tree_runner.py
+++ b/examples/src/main/python/mllib/decision_tree_runner.py
@@ -106,8 +106,7 @@ def reindexClassLabels(data):
def usage():
print >> sys.stderr, \
- "Usage: decision_tree_runner [libsvm format data filepath]\n" + \
- " Note: This only supports binary classification."
+ "Usage: decision_tree_runner [libsvm format data filepath]"
exit(1)
@@ -127,16 +126,20 @@ if __name__ == "__main__":
# Re-index class labels if needed.
(reindexedData, origToNewLabels) = reindexClassLabels(points)
+ numClasses = len(origToNewLabels)
# Train a classifier.
categoricalFeaturesInfo = {} # no categorical features
- model = DecisionTree.trainClassifier(reindexedData, numClasses=2,
+ model = DecisionTree.trainClassifier(reindexedData, numClasses=numClasses,
categoricalFeaturesInfo=categoricalFeaturesInfo)
# Print learned tree and stats.
print "Trained DecisionTree for classification:"
- print " Model numNodes: %d\n" % model.numNodes()
- print " Model depth: %d\n" % model.depth()
- print " Training accuracy: %g\n" % getAccuracy(model, reindexedData)
- print model
+ print " Model numNodes: %d" % model.numNodes()
+ print " Model depth: %d" % model.depth()
+ print " Training accuracy: %g" % getAccuracy(model, reindexedData)
+ if model.numNodes() < 20:
+ print model.toDebugString()
+ else:
+ print model
sc.stop()
diff --git a/examples/src/main/python/mllib/random_forest_example.py b/examples/src/main/python/mllib/random_forest_example.py
new file mode 100755
index 0000000000..d3c24f7664
--- /dev/null
+++ b/examples/src/main/python/mllib/random_forest_example.py
@@ -0,0 +1,89 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Random Forest classification and regression using MLlib.
+
+Note: This example illustrates binary classification.
+ For information on multiclass classification, please refer to the decision_tree_runner.py
+ example.
+"""
+
+import sys
+
+from pyspark.context import SparkContext
+from pyspark.mllib.tree import RandomForest
+from pyspark.mllib.util import MLUtils
+
+
+def testClassification(trainingData, testData):
+ # Train a RandomForest model.
+ # Empty categoricalFeaturesInfo indicates all features are continuous.
+ # Note: Use larger numTrees in practice.
+ # Setting featureSubsetStrategy="auto" lets the algorithm choose.
+ model = RandomForest.trainClassifier(trainingData, numClasses=2,
+ categoricalFeaturesInfo={},
+ numTrees=3, featureSubsetStrategy="auto",
+ impurity='gini', maxDepth=4, maxBins=32)
+
+ # Evaluate model on test instances and compute test error
+ predictions = model.predict(testData.map(lambda x: x.features))
+ labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+ testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count()\
+ / float(testData.count())
+ print('Test Error = ' + str(testErr))
+ print('Learned classification forest model:')
+ print(model.toDebugString())
+
+
+def testRegression(trainingData, testData):
+ # Train a RandomForest model.
+ # Empty categoricalFeaturesInfo indicates all features are continuous.
+ # Note: Use larger numTrees in practice.
+ # Setting featureSubsetStrategy="auto" lets the algorithm choose.
+ model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
+ numTrees=3, featureSubsetStrategy="auto",
+ impurity='variance', maxDepth=4, maxBins=32)
+
+ # Evaluate model on test instances and compute test error
+ predictions = model.predict(testData.map(lambda x: x.features))
+ labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
+ testMSE = labelsAndPredictions.map(lambda (v, p): (v - p) * (v - p)).sum()\
+ / float(testData.count())
+ print('Test Mean Squared Error = ' + str(testMSE))
+ print('Learned regression forest model:')
+ print(model.toDebugString())
+
+
+if __name__ == "__main__":
+ if len(sys.argv) > 1:
+ print >> sys.stderr, "Usage: random_forest_example"
+ exit(1)
+ sc = SparkContext(appName="PythonRandomForestExample")
+
+ # Load and parse the data file into an RDD of LabeledPoint.
+ data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
+ # Split the data into training and test sets (30% held out for testing)
+ (trainingData, testData) = data.randomSplit([0.7, 0.3])
+
+ print('\nRunning example of classification using RandomForest\n')
+ testClassification(trainingData, testData)
+
+ print('\nRunning example of regression using RandomForest\n')
+ testRegression(trainingData, testData)
+
+ sc.stop()
diff --git a/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala b/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
index 54953adb5f..205d80dd02 100644
--- a/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
@@ -278,7 +278,7 @@ object DecisionTreeRunner {
impurity = impurityCalculator,
maxDepth = params.maxDepth,
maxBins = params.maxBins,
- numClassesForClassification = numClasses,
+ numClasses = numClasses,
minInstancesPerNode = params.minInstancesPerNode,
minInfoGain = params.minInfoGain,
useNodeIdCache = params.useNodeIdCache,
diff --git a/examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostedTreesRunner.scala b/examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostedTreesRunner.scala
index 1def8b45a2..431ead8c0c 100644
--- a/examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostedTreesRunner.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostedTreesRunner.scala
@@ -103,7 +103,7 @@ object GradientBoostedTreesRunner {
params.dataFormat, params.testInput, Algo.withName(params.algo), params.fracTest)
val boostingStrategy = BoostingStrategy.defaultParams(params.algo)
- boostingStrategy.treeStrategy.numClassesForClassification = numClasses
+ boostingStrategy.treeStrategy.numClasses = numClasses
boostingStrategy.numIterations = params.numIterations
boostingStrategy.treeStrategy.maxDepth = params.maxDepth
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 9f20cd5d00..c4e5fd8e46 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -477,7 +477,7 @@ class PythonMLLibAPI extends Serializable {
algo = algo,
impurity = impurity,
maxDepth = maxDepth,
- numClassesForClassification = numClasses,
+ numClasses = numClasses,
maxBins = maxBins,
categoricalFeaturesInfo = categoricalFeaturesInfo.asScala.toMap,
minInstancesPerNode = minInstancesPerNode,
@@ -513,7 +513,7 @@ class PythonMLLibAPI extends Serializable {
algo = algo,
impurity = impurity,
maxDepth = maxDepth,
- numClassesForClassification = numClasses,
+ numClasses = numClasses,
maxBins = maxBins,
categoricalFeaturesInfo = categoricalFeaturesInfo.asScala.toMap)
val cached = data.rdd.persist(StorageLevel.MEMORY_AND_DISK)
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
index 3d91867c89..73e7e32c6d 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@@ -136,7 +136,7 @@ object DecisionTree extends Serializable with Logging {
* @param impurity impurity criterion used for information gain calculation
* @param maxDepth Maximum depth of the tree.
* E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
- * @param numClassesForClassification number of classes for classification. Default value of 2.
+ * @param numClasses number of classes for classification. Default value of 2.
* @return DecisionTreeModel that can be used for prediction
*/
def train(
@@ -144,8 +144,8 @@ object DecisionTree extends Serializable with Logging {
algo: Algo,
impurity: Impurity,
maxDepth: Int,
- numClassesForClassification: Int): DecisionTreeModel = {
- val strategy = new Strategy(algo, impurity, maxDepth, numClassesForClassification)
+ numClasses: Int): DecisionTreeModel = {
+ val strategy = new Strategy(algo, impurity, maxDepth, numClasses)
new DecisionTree(strategy).run(input)
}
@@ -164,7 +164,7 @@ object DecisionTree extends Serializable with Logging {
* @param impurity criterion used for information gain calculation
* @param maxDepth Maximum depth of the tree.
* E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
- * @param numClassesForClassification number of classes for classification. Default value of 2.
+ * @param numClasses number of classes for classification. Default value of 2.
* @param maxBins maximum number of bins used for splitting features
* @param quantileCalculationStrategy algorithm for calculating quantiles
* @param categoricalFeaturesInfo Map storing arity of categorical features.
@@ -177,11 +177,11 @@ object DecisionTree extends Serializable with Logging {
algo: Algo,
impurity: Impurity,
maxDepth: Int,
- numClassesForClassification: Int,
+ numClasses: Int,
maxBins: Int,
quantileCalculationStrategy: QuantileStrategy,
categoricalFeaturesInfo: Map[Int,Int]): DecisionTreeModel = {
- val strategy = new Strategy(algo, impurity, maxDepth, numClassesForClassification, maxBins,
+ val strategy = new Strategy(algo, impurity, maxDepth, numClasses, maxBins,
quantileCalculationStrategy, categoricalFeaturesInfo)
new DecisionTree(strategy).run(input)
}
@@ -191,7 +191,7 @@ object DecisionTree extends Serializable with Logging {
*
* @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
* Labels should take values {0, 1, ..., numClasses-1}.
- * @param numClassesForClassification number of classes for classification.
+ * @param numClasses number of classes for classification.
* @param categoricalFeaturesInfo Map storing arity of categorical features.
* E.g., an entry (n -> k) indicates that feature n is categorical
* with k categories indexed from 0: {0, 1, ..., k-1}.
@@ -206,13 +206,13 @@ object DecisionTree extends Serializable with Logging {
*/
def trainClassifier(
input: RDD[LabeledPoint],
- numClassesForClassification: Int,
+ numClasses: Int,
categoricalFeaturesInfo: Map[Int, Int],
impurity: String,
maxDepth: Int,
maxBins: Int): DecisionTreeModel = {
val impurityType = Impurities.fromString(impurity)
- train(input, Classification, impurityType, maxDepth, numClassesForClassification, maxBins, Sort,
+ train(input, Classification, impurityType, maxDepth, numClasses, maxBins, Sort,
categoricalFeaturesInfo)
}
@@ -221,12 +221,12 @@ object DecisionTree extends Serializable with Logging {
*/
def trainClassifier(
input: JavaRDD[LabeledPoint],
- numClassesForClassification: Int,
+ numClasses: Int,
categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer],
impurity: String,
maxDepth: Int,
maxBins: Int): DecisionTreeModel = {
- trainClassifier(input.rdd, numClassesForClassification,
+ trainClassifier(input.rdd, numClasses,
categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap,
impurity, maxDepth, maxBins)
}
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala
index 482d339551..e9304b5e5c 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala
@@ -56,7 +56,7 @@ import org.apache.spark.util.Utils
* etc.
* @param numTrees If 1, then no bootstrapping is used. If > 1, then bootstrapping is done.
* @param featureSubsetStrategy Number of features to consider for splits at each node.
- * Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
+ * Supported: "auto", "all", "sqrt", "log2", "onethird".
* If "auto" is set, this parameter is set based on numTrees:
* if numTrees == 1, set to "all";
* if numTrees > 1 (forest) set to "sqrt" for classification and
@@ -269,7 +269,7 @@ object RandomForest extends Serializable with Logging {
* @param strategy Parameters for training each tree in the forest.
* @param numTrees Number of trees in the random forest.
* @param featureSubsetStrategy Number of features to consider for splits at each node.
- * Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
+ * Supported: "auto", "all", "sqrt", "log2", "onethird".
* If "auto" is set, this parameter is set based on numTrees:
* if numTrees == 1, set to "all";
* if numTrees > 1 (forest) set to "sqrt".
@@ -293,13 +293,13 @@ object RandomForest extends Serializable with Logging {
*
* @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
* Labels should take values {0, 1, ..., numClasses-1}.
- * @param numClassesForClassification number of classes for classification.
+ * @param numClasses number of classes for classification.
* @param categoricalFeaturesInfo Map storing arity of categorical features.
* E.g., an entry (n -> k) indicates that feature n is categorical
* with k categories indexed from 0: {0, 1, ..., k-1}.
* @param numTrees Number of trees in the random forest.
* @param featureSubsetStrategy Number of features to consider for splits at each node.
- * Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
+ * Supported: "auto", "all", "sqrt", "log2", "onethird".
* If "auto" is set, this parameter is set based on numTrees:
* if numTrees == 1, set to "all";
* if numTrees > 1 (forest) set to "sqrt".
@@ -315,7 +315,7 @@ object RandomForest extends Serializable with Logging {
*/
def trainClassifier(
input: RDD[LabeledPoint],
- numClassesForClassification: Int,
+ numClasses: Int,
categoricalFeaturesInfo: Map[Int, Int],
numTrees: Int,
featureSubsetStrategy: String,
@@ -325,7 +325,7 @@ object RandomForest extends Serializable with Logging {
seed: Int = Utils.random.nextInt()): RandomForestModel = {
val impurityType = Impurities.fromString(impurity)
val strategy = new Strategy(Classification, impurityType, maxDepth,
- numClassesForClassification, maxBins, Sort, categoricalFeaturesInfo)
+ numClasses, maxBins, Sort, categoricalFeaturesInfo)
trainClassifier(input, strategy, numTrees, featureSubsetStrategy, seed)
}
@@ -334,7 +334,7 @@ object RandomForest extends Serializable with Logging {
*/
def trainClassifier(
input: JavaRDD[LabeledPoint],
- numClassesForClassification: Int,
+ numClasses: Int,
categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer],
numTrees: Int,
featureSubsetStrategy: String,
@@ -342,7 +342,7 @@ object RandomForest extends Serializable with Logging {
maxDepth: Int,
maxBins: Int,
seed: Int): RandomForestModel = {
- trainClassifier(input.rdd, numClassesForClassification,
+ trainClassifier(input.rdd, numClasses,
categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed)
}
@@ -355,7 +355,7 @@ object RandomForest extends Serializable with Logging {
* @param strategy Parameters for training each tree in the forest.
* @param numTrees Number of trees in the random forest.
* @param featureSubsetStrategy Number of features to consider for splits at each node.
- * Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
+ * Supported: "auto", "all", "sqrt", "log2", "onethird".
* If "auto" is set, this parameter is set based on numTrees:
* if numTrees == 1, set to "all";
* if numTrees > 1 (forest) set to "onethird".
@@ -384,7 +384,7 @@ object RandomForest extends Serializable with Logging {
* with k categories indexed from 0: {0, 1, ..., k-1}.
* @param numTrees Number of trees in the random forest.
* @param featureSubsetStrategy Number of features to consider for splits at each node.
- * Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
+ * Supported: "auto", "all", "sqrt", "log2", "onethird".
* If "auto" is set, this parameter is set based on numTrees:
* if numTrees == 1, set to "all";
* if numTrees > 1 (forest) set to "onethird".
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala
index e703adbdbf..cf51d041c6 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.scala
@@ -51,7 +51,7 @@ case class BoostingStrategy(
private[tree] def assertValid(): Unit = {
treeStrategy.algo match {
case Classification =>
- require(treeStrategy.numClassesForClassification == 2,
+ require(treeStrategy.numClasses == 2,
"Only binary classification is supported for boosting.")
case Regression =>
// nothing
@@ -80,12 +80,12 @@ object BoostingStrategy {
treeStrategy.maxDepth = 3
algo match {
case "Classification" =>
- treeStrategy.numClassesForClassification = 2
+ treeStrategy.numClasses = 2
new BoostingStrategy(treeStrategy, LogLoss)
case "Regression" =>
new BoostingStrategy(treeStrategy, SquaredError)
case _ =>
- throw new IllegalArgumentException(s"$algo is not supported by the boosting.")
+ throw new IllegalArgumentException(s"$algo is not supported by boosting.")
}
}
}
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
index d75f38433c..d5cd89ab94 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
@@ -37,7 +37,7 @@ import org.apache.spark.mllib.tree.configuration.QuantileStrategy._
* Supported for Regression: [[org.apache.spark.mllib.tree.impurity.Variance]].
* @param maxDepth Maximum depth of the tree.
* E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
- * @param numClassesForClassification Number of classes for classification.
+ * @param numClasses Number of classes for classification.
* (Ignored for regression.)
* Default value is 2 (binary classification).
* @param maxBins Maximum number of bins used for discretizing continuous features and
@@ -73,7 +73,7 @@ class Strategy (
@BeanProperty var algo: Algo,
@BeanProperty var impurity: Impurity,
@BeanProperty var maxDepth: Int,
- @BeanProperty var numClassesForClassification: Int = 2,
+ @BeanProperty var numClasses: Int = 2,
@BeanProperty var maxBins: Int = 32,
@BeanProperty var quantileCalculationStrategy: QuantileStrategy = Sort,
@BeanProperty var categoricalFeaturesInfo: Map[Int, Int] = Map[Int, Int](),
@@ -86,7 +86,7 @@ class Strategy (
@BeanProperty var checkpointInterval: Int = 10) extends Serializable {
def isMulticlassClassification =
- algo == Classification && numClassesForClassification > 2
+ algo == Classification && numClasses > 2
def isMulticlassWithCategoricalFeatures
= isMulticlassClassification && (categoricalFeaturesInfo.size > 0)
@@ -97,10 +97,10 @@ class Strategy (
algo: Algo,
impurity: Impurity,
maxDepth: Int,
- numClassesForClassification: Int,
+ numClasses: Int,
maxBins: Int,
categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer]) {
- this(algo, impurity, maxDepth, numClassesForClassification, maxBins, Sort,
+ this(algo, impurity, maxDepth, numClasses, maxBins, Sort,
categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap)
}
@@ -117,8 +117,8 @@ class Strategy (
*/
def setCategoricalFeaturesInfo(
categoricalFeaturesInfo: java.util.Map[java.lang.Integer, java.lang.Integer]): Unit = {
- setCategoricalFeaturesInfo(
- categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap)
+ this.categoricalFeaturesInfo =
+ categoricalFeaturesInfo.asInstanceOf[java.util.Map[Int, Int]].asScala.toMap
}
/**
@@ -128,9 +128,9 @@ class Strategy (
private[tree] def assertValid(): Unit = {
algo match {
case Classification =>
- require(numClassesForClassification >= 2,
- s"DecisionTree Strategy for Classification must have numClassesForClassification >= 2," +
- s" but numClassesForClassification = $numClassesForClassification.")
+ require(numClasses >= 2,
+ s"DecisionTree Strategy for Classification must have numClasses >= 2," +
+ s" but numClasses = $numClasses.")
require(Set(Gini, Entropy).contains(impurity),
s"DecisionTree Strategy given invalid impurity for Classification: $impurity." +
s" Valid settings: Gini, Entropy")
@@ -160,7 +160,7 @@ class Strategy (
/** Returns a shallow copy of this instance. */
def copy: Strategy = {
- new Strategy(algo, impurity, maxDepth, numClassesForClassification, maxBins,
+ new Strategy(algo, impurity, maxDepth, numClasses, maxBins,
quantileCalculationStrategy, categoricalFeaturesInfo, minInstancesPerNode, minInfoGain,
maxMemoryInMB, subsamplingRate, useNodeIdCache, checkpointDir, checkpointInterval)
}
@@ -176,9 +176,9 @@ object Strategy {
def defaultStrategy(algo: String): Strategy = algo match {
case "Classification" =>
new Strategy(algo = Classification, impurity = Gini, maxDepth = 10,
- numClassesForClassification = 2)
+ numClasses = 2)
case "Regression" =>
new Strategy(algo = Regression, impurity = Variance, maxDepth = 10,
- numClassesForClassification = 0)
+ numClasses = 0)
}
}
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
index 5bc0f2635c..951733fada 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
@@ -110,7 +110,7 @@ private[tree] object DecisionTreeMetadata extends Logging {
val numFeatures = input.take(1)(0).features.size
val numExamples = input.count()
val numClasses = strategy.algo match {
- case Classification => strategy.numClassesForClassification
+ case Classification => strategy.numClasses
case Regression => 0
}
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
index 972c905ec9..9347eaf922 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
@@ -57,7 +57,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
Classification,
Gini,
maxDepth = 2,
- numClassesForClassification = 2,
+ numClasses = 2,
maxBins = 100,
categoricalFeaturesInfo = Map(0 -> 2, 1-> 2))
@@ -81,7 +81,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
Classification,
Gini,
maxDepth = 2,
- numClassesForClassification = 2,
+ numClasses = 2,
maxBins = 100,
categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
@@ -177,7 +177,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
Classification,
Gini,
maxDepth = 2,
- numClassesForClassification = 100,
+ numClasses = 100,
maxBins = 100,
categoricalFeaturesInfo = Map(0 -> 3, 1-> 3))
@@ -271,7 +271,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
Classification,
Gini,
maxDepth = 2,
- numClassesForClassification = 100,
+ numClasses = 100,
maxBins = 100,
categoricalFeaturesInfo = Map(0 -> 10, 1-> 10))
// 2^(10-1) - 1 > 100, so categorical features will be ordered
@@ -295,7 +295,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val strategy = new Strategy(
Classification,
Gini,
- numClassesForClassification = 2,
+ numClasses = 2,
maxDepth = 2,
maxBins = 100,
categoricalFeaturesInfo = Map(0 -> 3, 1-> 3))
@@ -377,7 +377,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
assert(arr.length === 1000)
val rdd = sc.parallelize(arr)
val strategy = new Strategy(Classification, Gini, maxDepth = 3,
- numClassesForClassification = 2, maxBins = 100)
+ numClasses = 2, maxBins = 100)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
assert(!metadata.isUnordered(featureIndex = 0))
assert(!metadata.isUnordered(featureIndex = 1))
@@ -401,7 +401,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
assert(arr.length === 1000)
val rdd = sc.parallelize(arr)
val strategy = new Strategy(Classification, Gini, maxDepth = 3,
- numClassesForClassification = 2, maxBins = 100)
+ numClasses = 2, maxBins = 100)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
assert(!metadata.isUnordered(featureIndex = 0))
assert(!metadata.isUnordered(featureIndex = 1))
@@ -426,7 +426,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
assert(arr.length === 1000)
val rdd = sc.parallelize(arr)
val strategy = new Strategy(Classification, Entropy, maxDepth = 3,
- numClassesForClassification = 2, maxBins = 100)
+ numClasses = 2, maxBins = 100)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
assert(!metadata.isUnordered(featureIndex = 0))
assert(!metadata.isUnordered(featureIndex = 1))
@@ -451,7 +451,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
assert(arr.length === 1000)
val rdd = sc.parallelize(arr)
val strategy = new Strategy(Classification, Entropy, maxDepth = 3,
- numClassesForClassification = 2, maxBins = 100)
+ numClasses = 2, maxBins = 100)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
assert(!metadata.isUnordered(featureIndex = 0))
assert(!metadata.isUnordered(featureIndex = 1))
@@ -485,7 +485,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
// Train a 1-node model
val strategyOneNode = new Strategy(Classification, Entropy, maxDepth = 1,
- numClassesForClassification = 2, maxBins = 100)
+ numClasses = 2, maxBins = 100)
val modelOneNode = DecisionTree.train(rdd, strategyOneNode)
val rootNode1 = modelOneNode.topNode.deepCopy()
val rootNode2 = modelOneNode.topNode.deepCopy()
@@ -545,7 +545,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val arr = DecisionTreeSuite.generateCategoricalDataPointsForMulticlass()
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 3, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
+ numClasses = 3, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
assert(strategy.isMulticlassClassification)
assert(metadata.isUnordered(featureIndex = 0))
@@ -568,7 +568,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
arr(3) = new LabeledPoint(1.0, Vectors.dense(3.0))
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 2)
+ numClasses = 2)
val model = DecisionTree.train(rdd, strategy)
DecisionTreeSuite.validateClassifier(model, arr, 1.0)
@@ -585,7 +585,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 2)
+ numClasses = 2)
val model = DecisionTree.train(rdd, strategy)
DecisionTreeSuite.validateClassifier(model, arr, 1.0)
@@ -600,7 +600,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val arr = DecisionTreeSuite.generateCategoricalDataPointsForMulticlass()
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 3, maxBins = maxBins,
+ numClasses = 3, maxBins = maxBins,
categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
assert(strategy.isMulticlassClassification)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
@@ -629,7 +629,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val arr = DecisionTreeSuite.generateContinuousDataPointsForMulticlass()
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 3, maxBins = 100)
+ numClasses = 3, maxBins = 100)
assert(strategy.isMulticlassClassification)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
@@ -650,7 +650,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val arr = DecisionTreeSuite.generateContinuousDataPointsForMulticlass()
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 3, maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3))
+ numClasses = 3, maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3))
assert(strategy.isMulticlassClassification)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
assert(metadata.isUnordered(featureIndex = 0))
@@ -671,7 +671,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val arr = DecisionTreeSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 3, maxBins = 100,
+ numClasses = 3, maxBins = 100,
categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
assert(strategy.isMulticlassClassification)
val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
@@ -692,7 +692,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val arr = DecisionTreeSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 4,
- numClassesForClassification = 3, maxBins = 10,
+ numClasses = 3, maxBins = 10,
categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
assert(strategy.isMulticlassClassification)
@@ -708,7 +708,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini,
- maxDepth = 2, numClassesForClassification = 2, minInstancesPerNode = 2)
+ maxDepth = 2, numClasses = 2, minInstancesPerNode = 2)
val model = DecisionTree.train(rdd, strategy)
assert(model.topNode.isLeaf)
@@ -737,7 +737,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val rdd = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini,
maxBins = 2, maxDepth = 2, categoricalFeaturesInfo = Map(0 -> 2, 1-> 2),
- numClassesForClassification = 2, minInstancesPerNode = 2)
+ numClasses = 2, minInstancesPerNode = 2)
val rootNode = DecisionTree.train(rdd, strategy).topNode
@@ -755,7 +755,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val input = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 2,
- numClassesForClassification = 2, minInfoGain = 1.0)
+ numClasses = 2, minInfoGain = 1.0)
val model = DecisionTree.train(input, strategy)
assert(model.topNode.isLeaf)
@@ -781,7 +781,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val input = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 1,
- numClassesForClassification = 2, categoricalFeaturesInfo = Map(0 -> 3))
+ numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
val (splits, bins) = DecisionTree.findSplitsBins(input, metadata)
@@ -824,7 +824,7 @@ class DecisionTreeSuite extends FunSuite with MLlibTestSparkContext {
val input = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 5,
- numClassesForClassification = 2, categoricalFeaturesInfo = Map(0 -> 3))
+ numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
val (splits, bins) = DecisionTree.findSplitsBins(input, metadata)
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
index d4d54cf4c9..3aa97e5446 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
@@ -100,7 +100,7 @@ class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext {
val rdd = sc.parallelize(GradientBoostedTreesSuite.data, 2)
val treeStrategy = new Strategy(algo = Classification, impurity = Variance, maxDepth = 2,
- numClassesForClassification = 2, categoricalFeaturesInfo = Map.empty,
+ numClasses = 2, categoricalFeaturesInfo = Map.empty,
subsamplingRate = subsamplingRate)
val boostingStrategy =
new BoostingStrategy(treeStrategy, LogLoss, numIterations, learningRate)
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala
index 90a8c2dfda..f7f0f20c6c 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala
@@ -57,7 +57,7 @@ class RandomForestSuite extends FunSuite with MLlibTestSparkContext {
" comparing DecisionTree vs. RandomForest(numTrees = 1)") {
val categoricalFeaturesInfo = Map.empty[Int, Int]
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 2,
- numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo)
+ numClasses = 2, categoricalFeaturesInfo = categoricalFeaturesInfo)
binaryClassificationTestWithContinuousFeatures(strategy)
}
@@ -65,7 +65,7 @@ class RandomForestSuite extends FunSuite with MLlibTestSparkContext {
" comparing DecisionTree vs. RandomForest(numTrees = 1)") {
val categoricalFeaturesInfo = Map.empty[Int, Int]
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 2,
- numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo,
+ numClasses = 2, categoricalFeaturesInfo = categoricalFeaturesInfo,
useNodeIdCache = true)
binaryClassificationTestWithContinuousFeatures(strategy)
}
@@ -93,7 +93,7 @@ class RandomForestSuite extends FunSuite with MLlibTestSparkContext {
" comparing DecisionTree vs. RandomForest(numTrees = 1)") {
val categoricalFeaturesInfo = Map.empty[Int, Int]
val strategy = new Strategy(algo = Regression, impurity = Variance,
- maxDepth = 2, maxBins = 10, numClassesForClassification = 2,
+ maxDepth = 2, maxBins = 10, numClasses = 2,
categoricalFeaturesInfo = categoricalFeaturesInfo)
regressionTestWithContinuousFeatures(strategy)
}
@@ -102,7 +102,7 @@ class RandomForestSuite extends FunSuite with MLlibTestSparkContext {
" comparing DecisionTree vs. RandomForest(numTrees = 1)") {
val categoricalFeaturesInfo = Map.empty[Int, Int]
val strategy = new Strategy(algo = Regression, impurity = Variance,
- maxDepth = 2, maxBins = 10, numClassesForClassification = 2,
+ maxDepth = 2, maxBins = 10, numClasses = 2,
categoricalFeaturesInfo = categoricalFeaturesInfo, useNodeIdCache = true)
regressionTestWithContinuousFeatures(strategy)
}
@@ -169,14 +169,14 @@ class RandomForestSuite extends FunSuite with MLlibTestSparkContext {
test("Binary classification with continuous features: subsampling features") {
val categoricalFeaturesInfo = Map.empty[Int, Int]
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 2,
- numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo)
+ numClasses = 2, categoricalFeaturesInfo = categoricalFeaturesInfo)
binaryClassificationTestWithContinuousFeaturesAndSubsampledFeatures(strategy)
}
test("Binary classification with continuous features and node Id cache: subsampling features") {
val categoricalFeaturesInfo = Map.empty[Int, Int]
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 2,
- numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo,
+ numClasses = 2, categoricalFeaturesInfo = categoricalFeaturesInfo,
useNodeIdCache = true)
binaryClassificationTestWithContinuousFeaturesAndSubsampledFeatures(strategy)
}
@@ -191,7 +191,7 @@ class RandomForestSuite extends FunSuite with MLlibTestSparkContext {
val input = sc.parallelize(arr)
val strategy = new Strategy(algo = Classification, impurity = Gini, maxDepth = 5,
- numClassesForClassification = 3, categoricalFeaturesInfo = categoricalFeaturesInfo)
+ numClasses = 3, categoricalFeaturesInfo = categoricalFeaturesInfo)
val model = RandomForest.trainClassifier(input, strategy, numTrees = 2,
featureSubsetStrategy = "sqrt", seed = 12345)
EnsembleTestHelper.validateClassifier(model, arr, 1.0)
diff --git a/python/pyspark/mllib/tree.py b/python/pyspark/mllib/tree.py
index 46e253991a..6670247847 100644
--- a/python/pyspark/mllib/tree.py
+++ b/python/pyspark/mllib/tree.py
@@ -250,7 +250,7 @@ class RandomForest(object):
return RandomForestModel(model)
@classmethod
- def trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees,
+ def trainClassifier(cls, data, numClasses, categoricalFeaturesInfo, numTrees,
featureSubsetStrategy="auto", impurity="gini", maxDepth=4, maxBins=32,
seed=None):
"""
@@ -259,7 +259,7 @@ class RandomForest(object):
:param data: Training dataset: RDD of LabeledPoint. Labels should take
values {0, 1, ..., numClasses-1}.
- :param numClassesForClassification: number of classes for classification.
+ :param numClasses: number of classes for classification.
:param categoricalFeaturesInfo: Map storing arity of categorical features.
E.g., an entry (n -> k) indicates that feature n is categorical
with k categories indexed from 0: {0, 1, ..., k-1}.
@@ -320,7 +320,7 @@ class RandomForest(object):
>>> model.predict(rdd).collect()
[1.0, 0.0]
"""
- return cls._train(data, "classification", numClassesForClassification,
+ return cls._train(data, "classification", numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins, seed)