diff options
author | Nong Li <nong@databricks.com> | 2016-01-12 18:21:04 -0800 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-01-12 18:21:04 -0800 |
commit | 9247084962259ebbbac4c5a80a6ccb271776f019 (patch) | |
tree | c36d488c2890ec74a9ab22bf6fa753a25e1f4e2f /dev/audit-release/maven_app_core/src | |
parent | 4f60651cbec1b4c9cc2e6d832ace77e89a233f3a (diff) | |
download | spark-9247084962259ebbbac4c5a80a6ccb271776f019.tar.gz spark-9247084962259ebbbac4c5a80a6ccb271776f019.tar.bz2 spark-9247084962259ebbbac4c5a80a6ccb271776f019.zip |
[SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for execution.
There are many potential benefits of having an efficient in memory columnar format as an alternate
to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The
remaining implementation can be done as follow up patches.
As stated in the in the JIRA, there are useful external components that operate on memory in a
simple columnar format. ColumnarBatch would serve that purpose and could server as a
zero-serialization/zero-copy exchange for this use case.
This patch supports running the underlying data either on heap or off heap. On heap runs a bit
faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one
interface (ColumnVector).
This differs from Parquet or the existing columnar cache because this is *not* intended to be used
as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these
batches in memory per task. The layout of the values is just dense arrays of the value type.
Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>
Closes #10628 from nongli/spark-12635.
Diffstat (limited to 'dev/audit-release/maven_app_core/src')
0 files changed, 0 insertions, 0 deletions