aboutsummaryrefslogtreecommitdiff
path: root/mllib/src
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2015-09-29 23:30:27 -0700
committerCheng Lian <lian@databricks.com>2015-09-29 23:30:27 -0700
commit4d5a005b0d2591d4d57a19be48c4954b9f1434a9 (patch)
tree2a2cb24819477e69c2d8f5c8a2a79a87b558859c /mllib/src
parentc1ad373f26053e1906fce7681c03d130a642bf33 (diff)
downloadspark-4d5a005b0d2591d4d57a19be48c4954b9f1434a9.tar.gz
spark-4d5a005b0d2591d4d57a19be48c4954b9f1434a9.tar.bz2
spark-4d5a005b0d2591d4d57a19be48c4954b9f1434a9.zip
[SPARK-10811] [SQL] Eliminates unnecessary byte array copying
When reading Parquet string and binary-backed decimal values, Parquet `Binary.getBytes` always returns a copied byte array, which is unnecessary. Since the underlying implementation of `Binary` values there is guaranteed to be `ByteArraySliceBackedBinary`, and Parquet itself never reuses underlying byte arrays, we can use `Binary.toByteBuffer.array()` to steal the underlying byte arrays without copying them. This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18. My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS `store_sales` table (scale factor 15). Another minor optimization done in this PR is that, now we directly construct a Java `BigDecimal` in `Decimal.toJavaBigDecimal` without constructing a Scala `BigDecimal` first. This brings another ~5% performance gain. Author: Cheng Lian <lian@databricks.com> Closes #8907 from liancheng/spark-10811/eliminate-array-copying.
Diffstat (limited to 'mllib/src')
0 files changed, 0 insertions, 0 deletions