[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively

DecisionTree splits on continuous features by choosing an array of values from a subsample of the data. Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps: 1. Sort sample values for this feature 2. Get number of occurrence of each distinct value 3. Iterate the value count array computed in step 2 to choose splits. After find splits, `numSplits` and `numBins` in metadata will be updated. CC: mengxr manishamde jkbradley, please help me review this, thanks. Author: Qiping Li <liqiping1991@gmail.com> Author: chouqin <liqiping1991@gmail.com> Author: liqi <liqiping1991@gmail.com> Author: qiping.lqp <qiping.lqp@alibaba-inc.com> Closes #2780 from chouqin/dt-findsplits and squashes the following commits: 18d0301 [Qiping Li] check explicitly findsplits return distinct splits 8dc28ab [chouqin] remove blank lines ffc920f [chouqin] adjust code based on comments and add more test cases 9857039 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits d353596 [qiping.lqp] fix pyspark doc test 9e64699 [Qiping Li] fix random forest unit test 3c72913 [Qiping Li] fix random forest unit test 092efcb [Qiping Li] fix bug f69f47f [Qiping Li] fix bug ab303a4 [Qiping Li] fix bug af6dc97 [Qiping Li] fix bug 2a8267a [Qiping Li] fix bug c339a61 [Qiping Li] fix bug 369f812 [Qiping Li] fix style 8f46af6 [Qiping Li] add comments and unit test 9e7138e [Qiping Li] Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into dt-findsplits 1b25a35 [Qiping Li] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits 0cd744a [liqi] fix bug 3652823 [Qiping Li] fix bug af7cb79 [Qiping Li] Choose splits for continuous features in DecisionTree more adaptively
author: Qiping Li <liqiping1991@gmail.com> 2014-10-20 13:12:26 -0700
committer: Xiangrui Meng <meng@databricks.com> 2014-10-20 13:12:26 -0700
commit: eadc4c590ee43572528da55d84ed65f09153e857 (patch)
tree: 7a04ace345620ed17a79984c9b3a9241579b6e96 /python/pyspark
parent: 4afe9a4852ebeb4cc77322a14225cd3dec165f3f (diff)
download: spark-eadc4c590ee43572528da55d84ed65f09153e857.tar.gz
spark-eadc4c590ee43572528da55d84ed65f09153e857.tar.bz2
spark-eadc4c590ee43572528da55d84ed65f09153e857.zip
1 files changed, 2 insertions, 2 deletions
diff --git a/python/pyspark/mllib/tree.py b/python/pyspark/mllib/tree.py
index 0938eebd3a..64ee79d83e 100644
--- a/python/pyspark/mllib/tree.py
+++ b/python/pyspark/mllib/tree.py
@@ -153,9 +153,9 @@ class DecisionTree(object):
         DecisionTreeModel classifier of depth 1 with 3 nodes
         >>> print model.toDebugString(),  # it already has newline
         DecisionTreeModel classifier of depth 1 with 3 nodes
-          If (feature 0 <= 0.5)
+          If (feature 0 <= 0.0)
            Predict: 0.0
-          Else (feature 0 > 0.5)
+          Else (feature 0 > 0.0)
            Predict: 1.0
         >>> model.predict(array([1.0])) > 0
         True
author	Qiping Li <liqiping1991@gmail.com>	2014-10-20 13:12:26 -0700
committer	Xiangrui Meng <meng@databricks.com>	2014-10-20 13:12:26 -0700
commit	eadc4c590ee43572528da55d84ed65f09153e857 (patch)
tree	7a04ace345620ed17a79984c9b3a9241579b6e96 /python/pyspark
parent	4afe9a4852ebeb4cc77322a14225cd3dec165f3f (diff)
download	spark-eadc4c590ee43572528da55d84ed65f09153e857.tar.gz spark-eadc4c590ee43572528da55d84ed65f09153e857.tar.bz2 spark-eadc4c590ee43572528da55d84ed65f09153e857.zip