diff options
author | Joseph K. Bradley <joseph.kurata.bradley@gmail.com> | 2014-08-15 14:50:10 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2014-08-15 14:50:10 -0700 |
commit | c7032290a3f0f5545aa4f0a9a144c62571344dc8 (patch) | |
tree | 4e9da3e875eda32ef0e430f4928f3ab6e2d31e3c /core/src/main/scala/org/apache/spark/serializer/Serializer.scala | |
parent | 0afe5cb65a195d2f14e8dfcefdbec5dac023651f (diff) | |
download | spark-c7032290a3f0f5545aa4f0a9a144c62571344dc8.tar.gz spark-c7032290a3f0f5545aa4f0a9a144c62571344dc8.tar.bz2 spark-c7032290a3f0f5545aa4f0a9a144c62571344dc8.zip |
[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix
DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals
Details:
(1) TreePoint representation to avoid binning multiple times
[https://issues.apache.org/jira/browse/SPARK-3022]
Added private[tree] TreePoint class for representing binned feature values.
The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times.
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
[https://issues.apache.org/jira/browse/SPARK-3041]
isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.
(3) Timing for DecisionTree internals
Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.
CC: mengxr manishamde chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes #1950 from jkbradley/dt-opt1 and squashes the following commits:
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
Diffstat (limited to 'core/src/main/scala/org/apache/spark/serializer/Serializer.scala')
0 files changed, 0 insertions, 0 deletions