diff options
author | tianyi <tianyi@asiainfo-linkage.com> | 2014-12-16 15:22:29 -0800 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2014-12-16 15:22:29 -0800 |
commit | 30f6b85c816d1ef611a7be071af0053d64b6fe9e (patch) | |
tree | 31407834a106ccfe88bf6be7e2fdb6b2836b15e9 /LICENSE | |
parent | ea1315e3e26507c8e1cab877cec5fe69c2899ae8 (diff) | |
download | spark-30f6b85c816d1ef611a7be071af0053d64b6fe9e.tar.gz spark-30f6b85c816d1ef611a7be071af0053d64b6fe9e.tar.bz2 spark-30f6b85c816d1ef611a7be071af0053d64b6fe9e.zip |
[SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin
In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.
for
table test_csv contains 1 million records
table dim_csv contains 10 thousand records
SQL:
`select * from test_csv a left outer join dim_csv b on a.key = b.key`
the result is:
master:
```
CSV: 12671 ms
CSV: 9021 ms
CSV: 9200 ms
Current Mem Usage:787788984
```
after patch:
```
CSV: 10382 ms
CSV: 7543 ms
CSV: 7469 ms
Current Mem Usage:208145728
```
Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes #3375 from tianyi/SPARK-4483 and squashes the following commits:
72a8aec [tianyi] avoid having mutable state stored inside of the task
99c5c97 [tianyi] performance optimization
d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
2be45d1 [tianyi] fix spell bug
1f2c6f1 [tianyi] remove commented codes
a676de6 [tianyi] optimize some codes
9e7d5b5 [tianyi] remove commented old codes
838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
Diffstat (limited to 'LICENSE')
0 files changed, 0 insertions, 0 deletions