diff options
author | Xiangrui Meng <meng@databricks.com> | 2014-07-29 22:16:20 -0700 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2014-07-29 22:16:20 -0700 |
commit | 2e6efcacea19bddbdae1d655ef54186f2e52747f (patch) | |
tree | f6d1a766cd65c7aec636dacffbc4c62c37298315 /sql | |
parent | 84467468d466dadf4708a7d6a808471305149713 (diff) | |
download | spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.tar.gz spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.tar.bz2 spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.zip |
[SPARK-2568] RangePartitioner should run only one job if data is balanced
As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort).
`RangePartitioner` should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items.
The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after.
Author: Xiangrui Meng <meng@databricks.com>
Author: Reynold Xin <rxin@apache.org>
Closes #1562 from mengxr/range-partitioner and squashes the following commits:
6cc2551 [Xiangrui Meng] change foreach to for
eb39b08 [Xiangrui Meng] Merge branch 'master' into range-partitioner
eb95dd8 [Xiangrui Meng] separate sketching and determining bounds impl
c436d30 [Xiangrui Meng] fix binary metrics unit tests
db58a55 [Xiangrui Meng] add unit tests
a6e35d6 [Xiangrui Meng] minor update
60be09e [Xiangrui Meng] remove importance sampler
9ee9992 [Xiangrui Meng] update range partitioner to run only one job on roughly balanced data
cc12f47 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
06ac2ec [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
17bcbf3 [Reynold Xin] Added seed.
badf20d [Reynold Xin] Renamed the method.
6940010 [Reynold Xin] Reservoir sampling implementation.
Diffstat (limited to 'sql')
0 files changed, 0 insertions, 0 deletions