aboutsummaryrefslogtreecommitdiff
path: root/bin/spark-submit
diff options
context:
space:
mode:
authorPatrick Wendell <pwendell@gmail.com>2014-05-11 17:11:55 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-05-11 17:11:55 -0700
commit7d9cc9214bd06495f6838e355331dd2b5f1f7407 (patch)
tree64ff0aab184553bf21b7aa2789c8fc5a425787d6 /bin/spark-submit
parent6bee01dd04ef73c6b829110ebcdd622d521ea8ff (diff)
downloadspark-7d9cc9214bd06495f6838e355331dd2b5f1f7407.tar.gz
spark-7d9cc9214bd06495f6838e355331dd2b5f1f7407.tar.bz2
spark-7d9cc9214bd06495f6838e355331dd2b5f1f7407.zip
SPARK-1770: Load balance elements when repartitioning.
This patch adds better balancing when performing a repartition of an RDD. Previously the elements in the RDD were hash partitioned, meaning if the RDD was skewed certain partitions would end up being very large. This commit adds load balancing of elements across the repartitioned RDD splits. The load balancing is not perfect: a given output partition can have up to N more elements than the average if there are N input partitions. However, some randomization is used to minimize the probabiliy that this happens. Author: Patrick Wendell <pwendell@gmail.com> Closes #727 from pwendell/load-balance and squashes the following commits: f9da752 [Patrick Wendell] Response to Matei's feedback acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
Diffstat (limited to 'bin/spark-submit')
0 files changed, 0 insertions, 0 deletions