[SPARK-10811] [SQL] Eliminates unnecessary byte array copying - spark

diff options

author	Cheng Lian <lian@databricks.com>	2015-09-29 23:30:27 -0700
committer	Cheng Lian <lian@databricks.com>	2015-09-29 23:30:27 -0700
commit	4d5a005b0d2591d4d57a19be48c4954b9f1434a9 (patch)
tree	2a2cb24819477e69c2d8f5c8a2a79a87b558859c /mllib/src
parent	c1ad373f26053e1906fce7681c03d130a642bf33 (diff)
download	spark-4d5a005b0d2591d4d57a19be48c4954b9f1434a9.tar.gz spark-4d5a005b0d2591d4d57a19be48c4954b9f1434a9.tar.bz2 spark-4d5a005b0d2591d4d57a19be48c4954b9f1434a9.zip

[SPARK-10811] [SQL] Eliminates unnecessary byte array copying

When reading Parquet string and binary-backed decimal values, Parquet `Binary.getBytes` always returns a copied byte array, which is unnecessary. Since the underlying implementation of `Binary` values there is guaranteed to be `ByteArraySliceBackedBinary`, and Parquet itself never reuses underlying byte arrays, we can use `Binary.toByteBuffer.array()` to steal the underlying byte arrays without copying them. This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18. My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS `store_sales` table (scale factor 15). Another minor optimization done in this PR is that, now we directly construct a Java `BigDecimal` in `Decimal.toJavaBigDecimal` without constructing a Scala `BigDecimal` first. This brings another ~5% performance gain. Author: Cheng Lian <lian@databricks.com> Closes #8907 from liancheng/spark-10811/eliminate-array-copying.

Diffstat (limited to 'mllib/src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: