[SPARK-11149] [SQL] Improve cache performance for primitive types - spark

diff options

author	Davies Liu <davies@databricks.com>	2015-10-20 14:01:53 -0700
committer	Davies Liu <davies.liu@gmail.com>	2015-10-20 14:01:53 -0700
commit	06e6b765d0c747b773d7f3be28ddb0543c955a1f (patch)
tree	13ba86c25a5471f429f0dcf2d7e37ace474a0233 /mllib
parent	67d468f8d9172569ec9846edc6432240547696dd (diff)
download	spark-06e6b765d0c747b773d7f3be28ddb0543c955a1f.tar.gz spark-06e6b765d0c747b773d7f3be28ddb0543c955a1f.tar.bz2 spark-06e6b765d0c747b773d7f3be28ddb0543c955a1f.zip

[SPARK-11149] [SQL] Improve cache performance for primitive types

This PR improve the performance by: 1) Generate an Iterator that take Iterator[CachedBatch] as input, and call accessors (unroll the loop for columns), avoid the expensive Iterator.flatMap. 2) Use Unsafe.getInt/getLong/getFloat/getDouble instead of ByteBuffer.getInt/getLong/getFloat/getDouble, the later one actually read byte by byte. 3) Remove the unnecessary copy() in Coalesce(), which is not related to memory cache, found during benchmark. The following benchmark showed that we can speedup the columnar cache of int by 2x. ``` path = '/opt/tpcds/store_sales/' int_cols = ['ss_sold_date_sk', 'ss_sold_time_sk', 'ss_item_sk','ss_customer_sk'] df = sqlContext.read.parquet(path).select(int_cols).cache() df.count() t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` Author: Davies Liu <davies@databricks.com> Closes #9145 from davies/byte_buffer.

Diffstat (limited to 'mllib')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: