[SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2016-02-08 11:38:21 -0800
committer	Davies Liu <davies.liu@gmail.com>	2016-02-08 11:38:21 -0800
commit	06f0df6df204c4722ff8a6bf909abaa32a715c41 (patch)
tree	2e2772173ea0dc38a555c3edf08094aac2cd4ec8 /docs/mllib-decision-tree.md
parent	edf4a0e62e6fdb849cca4f23a7060da5ec782b07 (diff)
download	spark-06f0df6df204c4722ff8a6bf909abaa32a715c41.tar.gz spark-06f0df6df204c4722ff8a6bf909abaa32a715c41.tar.bz2 spark-06f0df6df204c4722ff8a6bf909abaa32a715c41.zip

[SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit

This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen. At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`. In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic. In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it. Author: Josh Rosen <joshrosen@databricks.com> Closes #7334 from JoshRosen/remove-copy-in-limit.

Diffstat (limited to 'docs/mllib-decision-tree.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: