[SPARK-11316] coalesce doesn't handle UnionRDD with partial locality properly - spark

diff options

author	Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com>	2016-05-03 13:43:20 -0700
committer	Davies Liu <davies.liu@gmail.com>	2016-05-03 13:43:20 -0700
commit	83ee92f60345f016a390d61a82f1d924f64ddf90 (patch)
tree	72c7760110fe63ca948366c0bccade9ae724754a /repl/scala-2.10/src
parent	a4aed71719b4fc728de93afc623aef05d27bc89a (diff)
download	spark-83ee92f60345f016a390d61a82f1d924f64ddf90.tar.gz spark-83ee92f60345f016a390d61a82f1d924f64ddf90.tar.bz2 spark-83ee92f60345f016a390d61a82f1d924f64ddf90.zip

[SPARK-11316] coalesce doesn't handle UnionRDD with partial locality properly

## What changes were proposed in this pull request? coalesce doesn't handle UnionRDD with partial locality properly. I had a user who had a UnionRDD that was made up of mapPartitionRDD without preferred locations and a checkpointedRDD with preferred locations (getting from hdfs). It took the driver over 20 minutes to setup the groups and put the partitions into those groups before it even started any tasks. Even perhaps worse is it didn't end up with the number of partitions he was asking for because it didn't put a partition in each of the groups properly. The changes in this patch get rid of a n^2 while loop that was causing the 20 minutes, it properly distributes the partitions to have at least one per group, and it changes from using the rotation iterator which got the preferred locations many times to get all the preferred locations once up front. Note that the n^2 while loop that I removed in setupGroups took so long because all of the partitions with preferred locations were already assigned to group, so it basically looped through every single one and wasn't ever able to assign it. At the time I had 960 partitions with preferred locations and 1020 without and did the outer while loop 319 times because that is the # of groups left to create. Note that each of those times through the inner while loop is going off to hdfs to get the block locations, so this is extremely inefficient. ## How was the this patch tested? Added unit tests for this case and ran existing ones that applied to make sure no regressions. Also manually tested on the users production job to make sure it fixed their issue. It created the proper number of partitions and now it takes about 6 seconds rather then 20 minutes. I did also run some basic manual tests with spark-shell doing coalesced to smaller number, same number, and then greater with shuffle. Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com> Closes #11327 from tgravescs/SPARK-11316.

Diffstat (limited to 'repl/scala-2.10/src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: