[SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin - spark

diff options

author	tianyi <tianyi@asiainfo-linkage.com>	2014-12-16 15:22:29 -0800
committer	Michael Armbrust <michael@databricks.com>	2014-12-16 15:22:29 -0800
commit	30f6b85c816d1ef611a7be071af0053d64b6fe9e (patch)
tree	31407834a106ccfe88bf6be7e2fdb6b2836b15e9 /LICENSE
parent	ea1315e3e26507c8e1cab877cec5fe69c2899ae8 (diff)
download	spark-30f6b85c816d1ef611a7be071af0053d64b6fe9e.tar.gz spark-30f6b85c816d1ef611a7be071af0053d64b6fe9e.tar.bz2 spark-30f6b85c816d1ef611a7be071af0053d64b6fe9e.zip

[SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin

In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one. Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue. for table test_csv contains 1 million records table dim_csv contains 10 thousand records SQL: `select * from test_csv a left outer join dim_csv b on a.key = b.key` the result is: master: ``` CSV: 12671 ms CSV: 9021 ms CSV: 9200 ms Current Mem Usage:787788984 ``` after patch： ``` CSV: 10382 ms CSV: 7543 ms CSV: 7469 ms Current Mem Usage:208145728 ``` Author: tianyi <tianyi@asiainfo-linkage.com> Author: tianyi <tianyi.asiainfo@gmail.com> Closes #3375 from tianyi/SPARK-4483 and squashes the following commits: 72a8aec [tianyi] avoid having mutable state stored inside of the task 99c5c97 [tianyi] performance optimization d2f94d7 [tianyi] fix bug: missing output when the join-key is null. 2be45d1 [tianyi] fix spell bug 1f2c6f1 [tianyi] remove commented codes a676de6 [tianyi] optimize some codes 9e7d5b5 [tianyi] remove commented old codes 838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin

Diffstat (limited to 'LICENSE')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: