diff options
author | gatorsmile <gatorsmile@gmail.com> | 2016-04-29 15:30:36 +0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2016-04-29 15:30:36 +0800 |
commit | 222dcf79377df33007d7a9780dafa2c740dbe6a3 (patch) | |
tree | e251b64b68f42d99d2de4ed96b95ca0b0ff1419c /sql/core/src/test/java | |
parent | e249e6f8b551614c82cd62e827c3647166e918e3 (diff) | |
download | spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.gz spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.bz2 spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.zip |
[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join
#### What changes were proposed in this pull request?
Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
```SQL
SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2
==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2
```
Note:
1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL.
2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated
join conditions will be incorrect.
This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
```SQL
test("except") {
val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
val df_right = Seq(1, 3).toDF("id")
checkAnswer(
df_left.except(df_right),
Row(2) :: Row(2) :: Row(4) :: Nil
)
}
```
After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`.
#### How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #12736 from gatorsmile/exceptByAntiJoin.
Diffstat (limited to 'sql/core/src/test/java')
-rw-r--r-- | sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java index 5abd62cbc2..f1b1c22e4a 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java @@ -291,7 +291,7 @@ public class JavaDatasetSuite implements Serializable { unioned.collectAsList()); Dataset<String> subtracted = ds.except(ds2); - Assert.assertEquals(Arrays.asList("abc", "abc"), subtracted.collectAsList()); + Assert.assertEquals(Arrays.asList("abc"), subtracted.collectAsList()); } private static <T> Set<T> toSet(List<T> records) { |