[SPARK-16719][ML] Random Forests should communicate fewer trees on each iteration - spark

diff options

author	Joseph K. Bradley <joseph@databricks.com>	2016-09-22 22:27:28 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2016-09-22 22:27:28 -0700
commit	947b8c6e3acd671d501f0ed6c077aac8e51ccede (patch)
tree	fd6dc81c5d9ae8b4725ab9b90a372ccad4d69a87 /sql/core
parent	a4aeb7677bc07d0b83f82de62dcffd7867d19d9b (diff)
download	spark-947b8c6e3acd671d501f0ed6c077aac8e51ccede.tar.gz spark-947b8c6e3acd671d501f0ed6c077aac8e51ccede.tar.bz2 spark-947b8c6e3acd671d501f0ed6c077aac8e51ccede.zip

[SPARK-16719][ML] Random Forests should communicate fewer trees on each iteration

## What changes were proposed in this pull request? RandomForest currently sends the entire forest to each worker on each iteration. This is because (a) the node queue is FIFO and (b) the closure references the entire array of trees (topNodes). (a) causes RFs to handle splits in many trees, especially early on in learning. (b) sends all trees explicitly. This PR: (a) Change the RF node queue to be FILO (a stack), so that RFs tend to focus on 1 or a few trees before focusing on others. (b) Change topNodes to pass only the trees required on that iteration. ## How was this patch tested? Unit tests: * Existing tests for correctness of tree learning * Manually modifying code and running tests to verify that a small number of trees are communicated on each iteration * This last item is hard to test via unit tests given the current APIs. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14359 from jkbradley/rfs-fewer-trees.

Diffstat (limited to 'sql/core')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: