aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorDavies Liu <davies@databricks.com>2015-10-20 14:01:53 -0700
committerDavies Liu <davies.liu@gmail.com>2015-10-20 14:01:53 -0700
commit06e6b765d0c747b773d7f3be28ddb0543c955a1f (patch)
tree13ba86c25a5471f429f0dcf2d7e37ace474a0233 /mllib
parent67d468f8d9172569ec9846edc6432240547696dd (diff)
downloadspark-06e6b765d0c747b773d7f3be28ddb0543c955a1f.tar.gz
spark-06e6b765d0c747b773d7f3be28ddb0543c955a1f.tar.bz2
spark-06e6b765d0c747b773d7f3be28ddb0543c955a1f.zip
[SPARK-11149] [SQL] Improve cache performance for primitive types
This PR improve the performance by: 1) Generate an Iterator that take Iterator[CachedBatch] as input, and call accessors (unroll the loop for columns), avoid the expensive Iterator.flatMap. 2) Use Unsafe.getInt/getLong/getFloat/getDouble instead of ByteBuffer.getInt/getLong/getFloat/getDouble, the later one actually read byte by byte. 3) Remove the unnecessary copy() in Coalesce(), which is not related to memory cache, found during benchmark. The following benchmark showed that we can speedup the columnar cache of int by 2x. ``` path = '/opt/tpcds/store_sales/' int_cols = ['ss_sold_date_sk', 'ss_sold_time_sk', 'ss_item_sk','ss_customer_sk'] df = sqlContext.read.parquet(path).select(int_cols).cache() df.count() t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` Author: Davies Liu <davies@databricks.com> Closes #9145 from davies/byte_buffer.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions