aboutsummaryrefslogtreecommitdiff
path: root/sql
diff options
context:
space:
mode:
authorXiangrui Meng <meng@databricks.com>2014-07-29 22:16:20 -0700
committerReynold Xin <rxin@apache.org>2014-07-29 22:16:20 -0700
commit2e6efcacea19bddbdae1d655ef54186f2e52747f (patch)
treef6d1a766cd65c7aec636dacffbc4c62c37298315 /sql
parent84467468d466dadf4708a7d6a808471305149713 (diff)
downloadspark-2e6efcacea19bddbdae1d655ef54186f2e52747f.tar.gz
spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.tar.bz2
spark-2e6efcacea19bddbdae1d655ef54186f2e52747f.zip
[SPARK-2568] RangePartitioner should run only one job if data is balanced
As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). `RangePartitioner` should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items. The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after. Author: Xiangrui Meng <meng@databricks.com> Author: Reynold Xin <rxin@apache.org> Closes #1562 from mengxr/range-partitioner and squashes the following commits: 6cc2551 [Xiangrui Meng] change foreach to for eb39b08 [Xiangrui Meng] Merge branch 'master' into range-partitioner eb95dd8 [Xiangrui Meng] separate sketching and determining bounds impl c436d30 [Xiangrui Meng] fix binary metrics unit tests db58a55 [Xiangrui Meng] add unit tests a6e35d6 [Xiangrui Meng] minor update 60be09e [Xiangrui Meng] remove importance sampler 9ee9992 [Xiangrui Meng] update range partitioner to run only one job on roughly balanced data cc12f47 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part 06ac2ec [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part 17bcbf3 [Reynold Xin] Added seed. badf20d [Reynold Xin] Renamed the method. 6940010 [Reynold Xin] Reservoir sampling implementation.
Diffstat (limited to 'sql')
0 files changed, 0 insertions, 0 deletions