aboutsummaryrefslogtreecommitdiff
path: root/project
diff options
context:
space:
mode:
authorDavies Liu <davies@databricks.com>2016-02-26 09:58:05 -0800
committerDavies Liu <davies.liu@gmail.com>2016-02-26 09:58:05 -0800
commit6df1e55a6594ae4bc7882f44af8d230aad9489b4 (patch)
tree296c8e72e5b383f0647d9bd7b31853bdb5adc82d /project
parent727e78014fd4957e477d62adc977fa4da3e3455d (diff)
downloadspark-6df1e55a6594ae4bc7882f44af8d230aad9489b4.tar.gz
spark-6df1e55a6594ae4bc7882f44af8d230aad9489b4.tar.bz2
spark-6df1e55a6594ae4bc7882f44af8d230aad9489b4.zip
[SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoin
## What changes were proposed in this pull request? Currently, BroadcastNestedLoopJoin is implemented for worst case, it's too slow, very easy to hang forever. This PR will create fast path for some joinType and buildSide, also improve the worst case (will use much less memory than before). Before this PR, one task requires O(N*K) + O(K) in worst cases, N is number of rows from one partition of streamed table, it could hang the job (because of GC). In order to workaround this for InnerJoin, we have to disable auto-broadcast, switch to CartesianProduct: This could be workaround for InnerJoin, see https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html In this PR, we will have fast path for these joins : InnerJoin with BuildLeft or BuildRight LeftOuterJoin with BuildRight RightOuterJoin with BuildLeft LeftSemi with BuildRight These fast paths are all stream based (take one pass on streamed table), required O(1) memory. All other join types and build types will take two pass on streamed table, one pass to find the matched rows that includes streamed part, which require O(1) memory, another pass to find the rows from build table that does not have a matched row from streamed table, which required O(K) memory, K is the number rows from build side, one bit per row, should be much smaller than the memory for broadcast. The following join types work in this way: LeftOuterJoin with BuildLeft RightOuterJoin with BuildRight FullOuterJoin with BuildLeft or BuildRight LeftSemi with BuildLeft This PR also added tests for all the join types for BroadcastNestedLoopJoin. After this PR, for InnerJoin with one small table, BroadcastNestedLoopJoin should be faster than CartesianProduct, we don't need that workaround anymore. ## How was the this patch tested? Added unit tests. Author: Davies Liu <davies@databricks.com> Closes #11328 from davies/nested_loop.
Diffstat (limited to 'project')
0 files changed, 0 insertions, 0 deletions