[SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes - spark

diff options

author	Liang-Chi Hsieh <viirya@gmail.com>	2016-10-11 11:43:24 -0700
committer	Felix Cheung <felixcheung@apache.org>	2016-10-11 11:43:24 -0700
commit	07508bd01d16f3331be167ff92770d19c8b1f46a (patch)
tree	c9c4e44b1908c44a0f9235d2c60ced1587201c06 /R/pkg/inst/tests/testthat/test_mllib.R
parent	75b9e351413dca0930e8545e6283874db09d8482 (diff)
download	spark-07508bd01d16f3331be167ff92770d19c8b1f46a.tar.gz spark-07508bd01d16f3331be167ff92770d19c8b1f46a.tar.bz2 spark-07508bd01d16f3331be167ff92770d19c8b1f46a.zip

[SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes

## What changes were proposed in this pull request? Quoted from JIRA description: Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side. Please reference the following code for a reproducible example of this issue: num_partitions = 20000 a = sc.parallelize(range(int(1e6)), 2) # start with 2 even partitions l = a.repartition(num_partitions).glom().map(len).collect() # get length of each partition min(l), max(l), sum(l)/len(l), len(l) # skewed! In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #15389 from viirya/pyspark-rdd-repartition.

Diffstat (limited to 'R/pkg/inst/tests/testthat/test_mllib.R')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: