[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2016-02-14 17:32:21 -0800
committer	Reynold Xin <rxin@databricks.com>	2016-02-14 17:32:21 -0800
commit	a8bbc4f50ef3faacf4b7fe865a29144ea87f0796 (patch)
tree	feb2dafda734c39c96b60009d4ec2920753b9e9d /graphx
parent	7cb4d74c98c2f1765b48a549f62e47b53ed29b38 (diff)
download	spark-a8bbc4f50ef3faacf4b7fe865a29144ea87f0796.tar.gz spark-a8bbc4f50ef3faacf4b7fe865a29144ea87f0796.tar.bz2 spark-a8bbc4f50ef3faacf4b7fe865a29144ea87f0796.zip

[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN

This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases: - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children. - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger. These optimizations were proposed previously by gatorsmile in #10451 and #10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In #7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting. When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from #10451; see that patch for additional discussion. Author: Josh Rosen <joshrosen@databricks.com> Closes #11121 from JoshRosen/limit-pushdown-2.

Diffstat (limited to 'graphx')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: