[SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for execution. - spark

diff options

author	Nong Li <nong@databricks.com>	2016-01-12 18:21:04 -0800
committer	Reynold Xin <rxin@databricks.com>	2016-01-12 18:21:04 -0800
commit	9247084962259ebbbac4c5a80a6ccb271776f019 (patch)
tree	c36d488c2890ec74a9ab22bf6fa753a25e1f4e2f /core
parent	4f60651cbec1b4c9cc2e6d832ace77e89a233f3a (diff)
download	spark-9247084962259ebbbac4c5a80a6ccb271776f019.tar.gz spark-9247084962259ebbbac4c5a80a6ccb271776f019.tar.bz2 spark-9247084962259ebbbac4c5a80a6ccb271776f019.zip

[SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for execution.

There are many potential benefits of having an efficient in memory columnar format as an alternate to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The remaining implementation can be done as follow up patches. As stated in the in the JIRA, there are useful external components that operate on memory in a simple columnar format. ColumnarBatch would serve that purpose and could server as a zero-serialization/zero-copy exchange for this use case. This patch supports running the underlying data either on heap or off heap. On heap runs a bit faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one interface (ColumnVector). This differs from Parquet or the existing columnar cache because this is *not* intended to be used as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these batches in memory per task. The layout of the values is just dense arrays of the value type. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10628 from nongli/spark-12635.

Diffstat (limited to 'core')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: