[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join - spark

diff options

author	gatorsmile <gatorsmile@gmail.com>	2016-04-29 15:30:36 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-04-29 15:30:36 +0800
commit	222dcf79377df33007d7a9780dafa2c740dbe6a3 (patch)
tree	e251b64b68f42d99d2de4ed96b95ca0b0ff1419c /sql/hive-thriftserver/src
parent	e249e6f8b551614c82cd62e827c3647166e918e3 (diff)
download	spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.gz spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.tar.bz2 spark-222dcf79377df33007d7a9780dafa2c740dbe6a3.zip

[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join

#### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.

Diffstat (limited to 'sql/hive-thriftserver/src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: