[SPARK-13255] [SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows. - spark

diff options

author	Nong Li <nong@databricks.com>	2016-03-04 15:15:48 -0800
committer	Davies Liu <davies.liu@gmail.com>	2016-03-04 15:15:48 -0800
commit	a6e2bd31f52f9e9452e52ab5b846de3dee8b98a7 (patch)
tree	3f87ce37436459955c19d55874af93c892099b7c /streaming/src
parent	5f42c28b119b79c0ea4910c478853d451cd1a967 (diff)
download	spark-a6e2bd31f52f9e9452e52ab5b846de3dee8b98a7.tar.gz spark-a6e2bd31f52f9e9452e52ab5b846de3dee8b98a7.tar.bz2 spark-a6e2bd31f52f9e9452e52ab5b846de3dee8b98a7.zip

[SPARK-13255] [SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows.

## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Currently, the parquet reader returns rows one by one which is bad for performance. This patch updates the reader to directly return ColumnarBatches. This is only enabled with whole stage codegen, which is the only operator currently that is able to consume ColumnarBatches (instead of rows). The current implementation is a bit of a hack to get this to work and we should do more refactoring of these low level interfaces to make this work better. ## How was this patch tested? ``` Results: TPCDS: Best/Avg Time(ms) Rate(M/s) Per Row(ns) --------------------------------------------------------------------------------- q55 (before) 8897 / 9265 12.9 77.2 q55 5486 / 5753 21.0 47.6 ``` Author: Nong Li <nong@databricks.com> Closes #11435 from nongli/spark-13255.

Diffstat (limited to 'streaming/src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: