diff options
author | Josh Rosen <joshrosen@databricks.com> | 2015-07-19 23:41:28 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2015-07-19 23:41:28 -0700 |
commit | 79ec07290d0b4d16f1643af83824d926304c8f46 (patch) | |
tree | 34a606dca8ff1f603adc4f96f74daaa224c150d8 /data/mllib/sample_svm_data.txt | |
parent | 972d8900a1e2430d172968b11fdea14b289d7d4d (diff) | |
download | spark-79ec07290d0b4d16f1643af83824d926304c8f46.tar.gz spark-79ec07290d0b4d16f1643af83824d926304c8f46.tar.bz2 spark-79ec07290d0b4d16f1643af83824d926304c8f46.zip |
[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange
This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows. It also makes several general efficiency improvements to Exchange.
Key changes:
- When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle. After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs. This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs.
- Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand.
- When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows. This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes. Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7456 from JoshRosen/unsafe-exchange and squashes the following commits:
7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite
0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter
a27cfc1 [Josh Rosen] Add missing newline
741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange.
359c6a4 [Josh Rosen] Remove println() and add comments
93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange
8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them
dd9c66d [Josh Rosen] Fix for copying logic
035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer
7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle
cbea80b [Josh Rosen] Add UnsafeRowSerializer
0f2ac86 [Josh Rosen] Import ordering
3ca8515 [Josh Rosen] Big code simplification in Exchange
3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs
Diffstat (limited to 'data/mllib/sample_svm_data.txt')
0 files changed, 0 insertions, 0 deletions