diff options
author | gatorsmile <gatorsmile@gmail.com> | 2016-04-29 15:30:36 +0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2016-04-29 15:30:36 +0800 |
commit | 222dcf79377df33007d7a9780dafa2c740dbe6a3 (patch) | |
tree | e251b64b68f42d99d2de4ed96b95ca0b0ff1419c /sql/hive-thriftserver/src | |
parent | e249e6f8b551614c82cd62e827c3647166e918e3 (diff) | |
download | spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.gz spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.bz2 spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.zip |
[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
Note:
1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
join conditions will be incorrect.
This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
test("except") {
val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
val df_right = Seq(1, 3).toDF("id")
checkAnswer(
df_left.except(df_right),
Row(2) :: Row(2) :: Row(4) :: Nil
)
}
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.
#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #12736 from gatorsmile/exceptByAntiJoin.
Diffstat (limited to 'sql/hive-thriftserver/src')
0 files changed, 0 insertions, 0 deletions