[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join

#### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.
author: gatorsmile <gatorsmile@gmail.com> 2016-04-29 15:30:36 +0800
committer: Wenchen Fan <wenchen@databricks.com> 2016-04-29 15:30:36 +0800
commit: 222dcf79377df33007d7a9780dafa2c740dbe6a3 (patch)
tree: e251b64b68f42d99d2de4ed96b95ca0b0ff1419c /sql/core/src/test/java
parent: e249e6f8b551614c82cd62e827c3647166e918e3 (diff)
download: spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.gz
spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.bz2
spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
index 5abd62cbc2..f1b1c22e4a 100644
--- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
+++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java
@@ -291,7 +291,7 @@ public class JavaDatasetSuite implements Serializable {
       unioned.collectAsList());
 
     Dataset<String> subtracted = ds.except(ds2);
-    Assert.assertEquals(Arrays.asList("abc", "abc"), subtracted.collectAsList());
+    Assert.assertEquals(Arrays.asList("abc"), subtracted.collectAsList());
   }
 
   private static <T> Set<T> toSet(List<T> records) {
author	gatorsmile <gatorsmile@gmail.com>	2016-04-29 15:30:36 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-04-29 15:30:36 +0800
commit	222dcf79377df33007d7a9780dafa2c740dbe6a3 (patch)
tree	e251b64b68f42d99d2de4ed96b95ca0b0ff1419c /sql/core/src/test/java
parent	e249e6f8b551614c82cd62e827c3647166e918e3 (diff)
download	spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.gz spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.bz2 spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.zip