diff options
author | Nong Li <nong@databricks.com> | 2015-11-18 18:38:45 -0800 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2015-11-18 18:38:45 -0800 |
commit | 6d0848b53bbe6c5acdcf5c033cd396b1ae6e293d (patch) | |
tree | cf1c2b5184a996d4e931d1837dd7899199c2ba72 /pylintrc | |
parent | e61367b9f9bfc8e123369d55d7ca5925568b98a7 (diff) | |
download | spark-6d0848b53bbe6c5acdcf5c033cd396b1ae6e293d.tar.gz spark-6d0848b53bbe6c5acdcf5c033cd396b1ae6e293d.tar.bz2 spark-6d0848b53bbe6c5acdcf5c033cd396b1ae6e293d.zip |
[SPARK-11787][SQL] Improve Parquet scan performance when using flat schemas.
This patch adds an alternate to the Parquet RecordReader from the parquet-mr project
that is much faster for flat schemas. Instead of using the general converter mechanism
from parquet-mr, this directly uses the lower level APIs from parquet-columnar and a
customer RecordReader that directly assembles into UnsafeRows.
This is optionally disabled and only used for supported schemas.
Using the tpcds store sales table and doing a sum of increasingly more columns, the results
are:
For 1 Column:
Before: 11.3M rows/second
After: 18.2M rows/second
For 2 Columns:
Before: 7.2M rows/second
After: 11.2M rows/second
For 5 Columns:
Before: 2.9M rows/second
After: 4.5M rows/second
Author: Nong Li <nong@databricks.com>
Closes #9774 from nongli/parquet.
Diffstat (limited to 'pylintrc')
0 files changed, 0 insertions, 0 deletions