diff options
author | Patrick Wendell <pwendell@gmail.com> | 2014-05-11 17:11:55 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-05-11 17:11:55 -0700 |
commit | 7d9cc9214bd06495f6838e355331dd2b5f1f7407 (patch) | |
tree | 64ff0aab184553bf21b7aa2789c8fc5a425787d6 /bin/spark-submit | |
parent | 6bee01dd04ef73c6b829110ebcdd622d521ea8ff (diff) | |
download | spark-7d9cc9214bd06495f6838e355331dd2b5f1f7407.tar.gz spark-7d9cc9214bd06495f6838e355331dd2b5f1f7407.tar.bz2 spark-7d9cc9214bd06495f6838e355331dd2b5f1f7407.zip |
SPARK-1770: Load balance elements when repartitioning.
This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.
This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #727 from pwendell/load-balance and squashes the following commits:
f9da752 [Patrick Wendell] Response to Matei's feedback
acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
Diffstat (limited to 'bin/spark-submit')
0 files changed, 0 insertions, 0 deletions