diff options
author | Yin Huai <huai@cse.ohio-state.edu> | 2014-08-02 13:16:41 -0700 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2014-08-02 13:16:41 -0700 |
commit | 67bd8e3c217a80c3117a6e3853aa60fe13d08c91 (patch) | |
tree | 9cb43882eba9311c918b87d7def3abd93999fe95 /conf | |
parent | 3f67382e7c9c3f6a8f6ce124ab3fcb1a9c1a264f (diff) | |
download | spark-67bd8e3c217a80c3117a6e3853aa60fe13d08c91.tar.gz spark-67bd8e3c217a80c3117a6e3853aa60fe13d08c91.tar.bz2 spark-67bd8e3c217a80c3117a6e3853aa60fe13d08c91.zip |
[SQL] Set outputPartitioning of BroadcastHashJoin correctly.
I think we will not generate the plan triggering this bug at this moment. But, let me explain it...
Right now, we are using `left.outputPartitioning` as the `outputPartitioning` of a `BroadcastHashJoin`. We may have a wrong physical plan for cases like...
```sql
SELECT l.key, count(*)
FROM (SELECT key, count(*) as cnt
FROM src
GROUP BY key) l // This is buildPlan
JOIN r // This is the streamedPlan
ON (l.cnt = r.value)
GROUP BY l.key
```
Let's say we have a `BroadcastHashJoin` on `l` and `r`. For this case, we will pick `l`'s `outputPartitioning` for the `outputPartitioning`of the `BroadcastHashJoin` on `l` and `r`. Also, because the last `GROUP BY` is using `l.key` as the key, we will not introduce an `Exchange` for this aggregation. However, `r`'s outputPartitioning may not match the required distribution of the last `GROUP BY` and we fail to group data correctly.
JIRA is being reindexed. I will create a JIRA ticket once it is back online.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1735 from yhuai/BroadcastHashJoin and squashes the following commits:
96d9cb3 [Yin Huai] Set outputPartitioning correctly.
Diffstat (limited to 'conf')
0 files changed, 0 insertions, 0 deletions