[SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix

Major changes: * Added programming guide sections for tree ensembles * Added examples for tree ensembles * Updated DecisionTree programming guide with more info on parameters * **API change**: Standardized the tree parameter for the number of classes (for classification) Minor changes: * Updated decision tree documentation * Updated existing tree and tree ensemble examples * Use train/test split, and compute test error instead of training error. * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix) Note: I know this is a lot of lines, but most is covered by: * Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.) * New examples (which were copied from the programming guide) * The "numClasses" renaming I have run all examples and relevant unit tests. CC: mengxr manishamde codedeft Author: Joseph K. Bradley <joseph@databricks.com> Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #3461 from jkbradley/ensemble-docs and squashes the following commits: 70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide 8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide 6fab846 [Joseph K. Bradley] small fixes based on review b9f8576 [Joseph K. Bradley] updated decision tree doc 375204c [Joseph K. Bradley] fixed python style 2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide. 706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small c76c823 [Joseph K. Bradley] added migration guide for mllib abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder 07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification). cdfdfbc [Joseph K. Bradley] added examples for GBT 6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them. ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples (cherry picked from commit 657a88835d8bf22488b53d50f75281d7dc32442e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
author: Joseph K. Bradley <joseph@databricks.com> 2014-12-04 09:57:50 +0800
committer: Xiangrui Meng <meng@databricks.com> 2014-12-04 09:58:43 +0800
commit: 9880bb481943b45cb5ad981809cf5cbd7b0639bb (patch)
tree: 08b51e2b119040c0ab7593f4255f4112ab9a734f /python
parent: 4259ca8dd1217e135a1b2656307c33f2d48f6f50 (diff)
download: spark-9880bb481943b45cb5ad981809cf5cbd7b0639bb.tar.gz
spark-9880bb481943b45cb5ad981809cf5cbd7b0639bb.tar.bz2
spark-9880bb481943b45cb5ad981809cf5cbd7b0639bb.zip
1 files changed, 3 insertions, 3 deletions
diff --git a/python/pyspark/mllib/tree.py b/python/pyspark/mllib/tree.py
index 46e253991a..6670247847 100644
--- a/python/pyspark/mllib/tree.py
+++ b/python/pyspark/mllib/tree.py
@@ -250,7 +250,7 @@ class RandomForest(object):
         return RandomForestModel(model)
 
     @classmethod
-    def trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees,
+    def trainClassifier(cls, data, numClasses, categoricalFeaturesInfo, numTrees,
                         featureSubsetStrategy="auto", impurity="gini", maxDepth=4, maxBins=32,
                         seed=None):
         """
@@ -259,7 +259,7 @@ class RandomForest(object):
 
         :param data: Training dataset: RDD of LabeledPoint. Labels should take
                values {0, 1, ..., numClasses-1}.
-        :param numClassesForClassification: number of classes for classification.
+        :param numClasses: number of classes for classification.
         :param categoricalFeaturesInfo: Map storing arity of categorical features.
                E.g., an entry (n -> k) indicates that feature n is categorical
                with k categories indexed from 0: {0, 1, ..., k-1}.
@@ -320,7 +320,7 @@ class RandomForest(object):
         >>> model.predict(rdd).collect()
         [1.0, 0.0]
         """
-        return cls._train(data, "classification", numClassesForClassification,
+        return cls._train(data, "classification", numClasses,
                           categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity,
                           maxDepth, maxBins, seed)
author	Joseph K. Bradley <joseph@databricks.com>	2014-12-04 09:57:50 +0800
committer	Xiangrui Meng <meng@databricks.com>	2014-12-04 09:58:43 +0800
commit	9880bb481943b45cb5ad981809cf5cbd7b0639bb (patch)
tree	08b51e2b119040c0ab7593f4255f4112ab9a734f /python
parent	4259ca8dd1217e135a1b2656307c33f2d48f6f50 (diff)
download	spark-9880bb481943b45cb5ad981809cf5cbd7b0639bb.tar.gz spark-9880bb481943b45cb5ad981809cf5cbd7b0639bb.tar.bz2 spark-9880bb481943b45cb5ad981809cf5cbd7b0639bb.zip