[SPARK-10064] [ML] Parallelize decision tree bin split calculations - spark

diff options

author	Nathan Howell <nhowell@godaddy.com>	2015-10-07 17:46:16 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2015-10-07 17:46:16 -0700
commit	1bc435ae3afb7a007b8a8ff00dcad4738a9ff055 (patch)
tree	9e8ff1046739f58d90941d027faf4bf91f76eac6 /python/pyspark/ml/regression.py
parent	075a0b658289608c8732e07e26e14d736e673ce9 (diff)
download	spark-1bc435ae3afb7a007b8a8ff00dcad4738a9ff055.tar.gz spark-1bc435ae3afb7a007b8a8ff00dcad4738a9ff055.tar.bz2 spark-1bc435ae3afb7a007b8a8ff00dcad4738a9ff055.zip

[SPARK-10064] [ML] Parallelize decision tree bin split calculations

Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation. With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours. Author: Nathan Howell <nhowell@godaddy.com> Closes #8246 from NathanHowell/SPARK-10064.

Diffstat (limited to 'python/pyspark/ml/regression.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: