diff options
author | Cheng Lian <lian.cs.zju@gmail.com> | 2014-04-02 12:47:22 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-04-02 12:47:22 -0700 |
commit | 1faa57971192226837bea32eb29eae5bfb425a7e (patch) | |
tree | bfbe41e2007801ebd6f62f7b6d51d8a07d51ecd1 /mllib/src/test/java/org | |
parent | 78236334e4ca7518b6d7d9b38464dbbda854a777 (diff) | |
download | spark-1faa57971192226837bea32eb29eae5bfb425a7e.tar.gz spark-1faa57971192226837bea32eb29eae5bfb425a7e.tar.bz2 spark-1faa57971192226837bea32eb29eae5bfb425a7e.zip |
[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage
JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)
(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)
This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:
* `CompressionScheme`
Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:
* `RunLengthEncoding`
* `DictionaryEncoding`
Algorithms to be implemented include:
* `BooleanBitSet`
* `IntDelta`
* `LongDelta`
* `CompressibleColumnBuilder`
A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.
Memory layout of the final byte buffer is showed below:
```
.--------------------------- Column type ID (4 bytes)
| .----------------------- Null count N (4 bytes)
| | .------------------- Null positions (4 x N bytes, empty if null count is zero)
| | | .------------- Compression scheme ID (4 bytes)
| | | | .--------- Compressed non-null elements
V V V V V
+---+---+-----+---+---------+
| | | ... | | ... ... |
+---+---+-----+---+---------+
\-----------/ \-----------/
header body
```
* `CompressibleColumnAccessor`
A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.
* `ColumnStats`
Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.
Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).
A major refactoring change since PR #205 is:
* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #285 from liancheng/memColumnarCompression and squashes the following commits:
ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
Diffstat (limited to 'mllib/src/test/java/org')
0 files changed, 0 insertions, 0 deletions