[SPARK-10917] [SQL] improve performance of complex type in columnar cache - spark

diff options

author	Davies Liu <davies@databricks.com>	2015-10-07 15:58:07 -0700
committer	Davies Liu <davies.liu@gmail.com>	2015-10-07 15:58:07 -0700
commit	075a0b658289608c8732e07e26e14d736e673ce9 (patch)
tree	91ab61c1f6cf7d9284c00f4e35037da7721c812a /mllib
parent	dd36ec6bc5844aaa045a4bd9ba49113528e1740c (diff)
download	spark-075a0b658289608c8732e07e26e14d736e673ce9.tar.gz spark-075a0b658289608c8732e07e26e14d736e673ce9.tar.bz2 spark-075a0b658289608c8732e07e26e14d736e673ce9.zip

[SPARK-10917] [SQL] improve performance of complex type in columnar cache

This PR improve the performance of complex types in columnar cache by using UnsafeProjection instead of KryoSerializer. A simple benchmark show that this PR could improve the performance of scanning a cached table with complex columns by 15x (comparing to Spark 1.5). Here is the code used to benchmark: ``` df = sc.range(1<<23).map(lambda i: Row(a=Row(b=i, c=str(i)), d=range(10), e=dict(zip(range(10), [str(i) for i in range(10)])))).toDF() df.write.parquet("table") ``` ``` df = sqlContext.read.parquet("table") df.cache() df.count() t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` Author: Davies Liu <davies@databricks.com> Closes #8971 from davies/complex.

Diffstat (limited to 'mllib')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: