[SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTree

SPARK-3934: When run with a mix of unordered categorical and continuous features, on multiclass classification, RandomForest fails. The bug is in the sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the wrong indices for checking whether features are unordered. Fix: Remove the sanity checks since they are not really needed, and since they would require DTStatsAggregator to keep track of an extra set of indices (for the feature subset). Added test to RandomForestSuite which failed with old version but now works. SPARK-3918: Added baggedInput.unpersist at end of training. Also: * I removed DTStatsAggregator.isUnordered since it is no longer used. * DecisionTreeMetadata: Added logWarning when maxBins is automatically reduced. * Updated DecisionTreeRunner to explicitly fix the test data to have the same number of features as the training data. This is a temporary fix which should eventually be replaced by pre-indexing both datasets. * RandomForestModel: Updated toString to print total number of nodes in forest. * Changed Predict class to be public DeveloperApi. This was necessary to allow users to create their own trees by hand (for testing). CC: mengxr manishamde chouqin codedeft Just notifying you of these small bug fixes. Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2785 from jkbradley/dtrunner-update and squashes the following commits: 9132321 [Joseph K. Bradley] merged with master, fixed imports 9dbd000 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update e116473 [Joseph K. Bradley] Changed Predict class to be public DeveloperApi. f502e65 [Joseph K. Bradley] bug fix for SPARK-3934 7f3d60f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update ba567ab [Joseph K. Bradley] Changed DTRunner to load test data using same number of features as in training data. 4e88c1f [Joseph K. Bradley] changed RF toString to print total number of nodes
author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> 2014-10-17 15:02:57 -0700
committer: Xiangrui Meng <meng@databricks.com> 2014-10-17 15:02:57 -0700
commit: 477c6481cca94b15c9c8b43e674f220a1cda1dd1 (patch)
tree: c58fdab5e5fd89a64838fa79d889bfbf7e4cbbd5 /examples/src
parent: 23f6171d633d4347ca4aa8ec7cb7bd57342b21b5 (diff)
download: spark-477c6481cca94b15c9c8b43e674f220a1cda1dd1.tar.gz
spark-477c6481cca94b15c9c8b43e674f220a1cda1dd1.tar.bz2
spark-477c6481cca94b15c9c8b43e674f220a1cda1dd1.zip
1 files changed, 2 insertions, 1 deletions
diff --git a/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala b/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
index 837d059147..0890e6263e 100644
--- a/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
@@ -189,9 +189,10 @@ object DecisionTreeRunner {
     // Create training, test sets.
     val splits = if (params.testInput != "") {
       // Load testInput.
+      val numFeatures = examples.take(1)(0).features.size
       val origTestExamples = params.dataFormat match {
         case "dense" => MLUtils.loadLabeledPoints(sc, params.testInput)
-        case "libsvm" => MLUtils.loadLibSVMFile(sc, params.testInput)
+        case "libsvm" => MLUtils.loadLibSVMFile(sc, params.testInput, numFeatures)
       }
       params.algo match {
         case Classification => {
author	Joseph K. Bradley <joseph.kurata.bradley@gmail.com>	2014-10-17 15:02:57 -0700
committer	Xiangrui Meng <meng@databricks.com>	2014-10-17 15:02:57 -0700
commit	477c6481cca94b15c9c8b43e674f220a1cda1dd1 (patch)
tree	c58fdab5e5fd89a64838fa79d889bfbf7e4cbbd5 /examples/src
parent	23f6171d633d4347ca4aa8ec7cb7bd57342b21b5 (diff)
download	spark-477c6481cca94b15c9c8b43e674f220a1cda1dd1.tar.gz spark-477c6481cca94b15c9c8b43e674f220a1cda1dd1.tar.bz2 spark-477c6481cca94b15c9c8b43e674f220a1cda1dd1.zip