[SPARK-13977] [SQL] Brings back Shuffled hash join

## What changes were proposed in this pull request? ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast. ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory. This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side). A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin. ## How was this patch tested? Added new unit tests for ShuffledHashJoin. Author: Davies Liu <davies@databricks.com> Closes #11788 from davies/shuffle_join.
author: Davies Liu <davies@databricks.com> 2016-03-18 10:32:53 -0700
committer: Davies Liu <davies.liu@gmail.com> 2016-03-18 10:32:53 -0700
commit: 9c23c818ca0175c8f2a4a66eac261ec251d27c97 (patch)
tree: 33d8ebbcf0821b332358658c0be44e1b83ee8eda /sql/hive
parent: 14c7236dc63fe362f052175886e9ad700419bc63 (diff)
download: spark-9c23c818ca0175c8f2a4a66eac261ec251d27c97.tar.gz
spark-9c23c818ca0175c8f2a4a66eac261ec251d27c97.tar.bz2
spark-9c23c818ca0175c8f2a4a66eac261ec251d27c97.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
index 1468be4670..151aacbdd1 100644
--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala
@@ -230,7 +230,7 @@ class StatisticsSuite extends QueryTest with TestHiveSingleton {
       assert(bhj.isEmpty, "BroadcastHashJoin still planned even though it is switched off")
 
       val shj = df.queryExecution.sparkPlan.collect {
-        case j: LeftSemiJoinHash => j
+        case j: ShuffledHashJoin => j
       }
       assert(shj.size === 1,
         "LeftSemiJoinHash should be planned when BroadcastHashJoin is turned off")
author	Davies Liu <davies@databricks.com>	2016-03-18 10:32:53 -0700
committer	Davies Liu <davies.liu@gmail.com>	2016-03-18 10:32:53 -0700
commit	9c23c818ca0175c8f2a4a66eac261ec251d27c97 (patch)
tree	33d8ebbcf0821b332358658c0be44e1b83ee8eda /sql/hive
parent	14c7236dc63fe362f052175886e9ad700419bc63 (diff)
download	spark-9c23c818ca0175c8f2a4a66eac261ec251d27c97.tar.gz spark-9c23c818ca0175c8f2a4a66eac261ec251d27c97.tar.bz2 spark-9c23c818ca0175c8f2a4a66eac261ec251d27c97.zip