[SPARK-2568] RangePartitioner should run only one job if data is balanced - spark

diff options

author	Xiangrui Meng <meng@databricks.com>	2014-07-29 22:16:20 -0700
committer	Reynold Xin <rxin@apache.org>	2014-07-29 22:16:20 -0700
commit	2e6efcacea19bddbdae1d655ef54186f2e52747f (patch)
tree	f6d1a766cd65c7aec636dacffbc4c62c37298315 /sql
parent	84467468d466dadf4708a7d6a808471305149713 (diff)
download	spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.tar.gz spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.tar.bz2 spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.zip

[SPARK-2568] RangePartitioner should run only one job if data is balanced

As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). `RangePartitioner` should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items. The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after. Author: Xiangrui Meng <meng@databricks.com> Author: Reynold Xin <rxin@apache.org> Closes #1562 from mengxr/range-partitioner and squashes the following commits: 6cc2551 [Xiangrui Meng] change foreach to for eb39b08 [Xiangrui Meng] Merge branch 'master' into range-partitioner eb95dd8 [Xiangrui Meng] separate sketching and determining bounds impl c436d30 [Xiangrui Meng] fix binary metrics unit tests db58a55 [Xiangrui Meng] add unit tests a6e35d6 [Xiangrui Meng] minor update 60be09e [Xiangrui Meng] remove importance sampler 9ee9992 [Xiangrui Meng] update range partitioner to run only one job on roughly balanced data cc12f47 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part 06ac2ec [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part 17bcbf3 [Reynold Xin] Added seed. badf20d [Reynold Xin] Renamed the method. 6940010 [Reynold Xin] Reservoir sampling implementation.

Diffstat (limited to 'sql')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: