[SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160)

Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala. In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows: rightNodeAgg(featureIndex)(2 * (numBins - 2)) = binData(shift + (2 * numBins - 1))) Then there is a loop that sets the rest of the values, as in line 809: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 *(numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) But since splitIndex starts at 1, this ends up skipping a set of binData values. The changes here address this issue, for both the Regression and Classification cases. Author: johnnywalleye <jsondag@gmail.com> Closes #1316 from johnnywalleye/master and squashes the following commits: 73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations (cherry picked from commit 1114207cc8e4ef94cb97bbd5a2ef3ae4d51f73fa) Signed-off-by: Xiangrui Meng <meng@databricks.com>
author: johnnywalleye <jsondag@gmail.com> 2014-07-08 19:17:26 -0700
committer: Xiangrui Meng <meng@databricks.com> 2014-07-08 19:17:43 -0700
commit: d569838bc067f2b64f6c10e54ba8e5973f8fc93a (patch)
tree: c6b8244617ecd5e59d7fdbc2ffc6fcb7fd2d2f2b
parent: 885489112c82eb909df7efbf0515fd7abfae41a4 (diff)
download: spark-d569838bc067f2b64f6c10e54ba8e5973f8fc93a.tar.gz
spark-d569838bc067f2b64f6c10e54ba8e5973f8fc93a.tar.bz2
spark-d569838bc067f2b64f6c10e54ba8e5973f8fc93a.zip
1 files changed, 5 insertions, 5 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
index 3b13e52a7b..74d5d7ba10 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@@ -807,10 +807,10 @@ object DecisionTree extends Serializable with Logging {
               // calculating right node aggregate for a split as a sum of right node aggregate of a
               // higher split and the right bin aggregate of a bin where the split is a low split
               rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
-                binData(shift + (2 *(numBins - 2 - splitIndex))) +
+                binData(shift + (2 *(numBins - 1 - splitIndex))) +
                 rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))
               rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex) + 1) =
-                binData(shift + (2* (numBins - 2 - splitIndex) + 1)) +
+                binData(shift + (2* (numBins - 1 - splitIndex) + 1)) +
                   rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex) + 1)
 
               splitIndex += 1
@@ -855,13 +855,13 @@ object DecisionTree extends Serializable with Logging {
               // calculating right node aggregate for a split as a sum of right node aggregate of a
               // higher split and the right bin aggregate of a bin where the split is a low split
               rightNodeAgg(featureIndex)(3 * (numBins - 2 - splitIndex)) =
-                binData(shift + (3 * (numBins - 2 - splitIndex))) +
+                binData(shift + (3 * (numBins - 1 - splitIndex))) +
                   rightNodeAgg(featureIndex)(3 * (numBins - 1 - splitIndex))
               rightNodeAgg(featureIndex)(3 * (numBins - 2 - splitIndex) + 1) =
-                binData(shift + (3 * (numBins - 2 - splitIndex) + 1)) +
+                binData(shift + (3 * (numBins - 1 - splitIndex) + 1)) +
                   rightNodeAgg(featureIndex)(3 * (numBins - 1 - splitIndex) + 1)
               rightNodeAgg(featureIndex)(3 * (numBins - 2 - splitIndex) + 2) =
-                binData(shift + (3 * (numBins - 2 - splitIndex) + 2)) +
+                binData(shift + (3 * (numBins - 1 - splitIndex) + 2)) +
                   rightNodeAgg(featureIndex)(3 * (numBins - 1 - splitIndex) + 2)
 
               splitIndex += 1
author	johnnywalleye <jsondag@gmail.com>	2014-07-08 19:17:26 -0700
committer	Xiangrui Meng <meng@databricks.com>	2014-07-08 19:17:43 -0700
commit	d569838bc067f2b64f6c10e54ba8e5973f8fc93a (patch)
tree	c6b8244617ecd5e59d7fdbc2ffc6fcb7fd2d2f2b
parent	885489112c82eb909df7efbf0515fd7abfae41a4 (diff)
download	spark-d569838bc067f2b64f6c10e54ba8e5973f8fc93a.tar.gz spark-d569838bc067f2b64f6c10e54ba8e5973f8fc93a.tar.bz2 spark-d569838bc067f2b64f6c10e54ba8e5973f8fc93a.zip