[SPARK-12879] [SQL] improve the unsafe row writing framework - spark

diff options

author	Wenchen Fan <wenchen@databricks.com>	2016-01-25 16:23:59 -0800
committer	Davies Liu <davies.liu@gmail.com>	2016-01-25 16:23:59 -0800
commit	be375fcbd200fb0e210b8edcfceb5a1bcdbba94b (patch)
tree	060e087c33e27b44b30fb97f9861c97d8a5a06af /data
parent	6f0f1d9e04a8db47e2f6f8fcfe9dea9de0f633da (diff)
download	spark-be375fcbd200fb0e210b8edcfceb5a1bcdbba94b.tar.gz spark-be375fcbd200fb0e210b8edcfceb5a1bcdbba94b.tar.bz2 spark-be375fcbd200fb0e210b8edcfceb5a1bcdbba94b.zip

[SPARK-12879] [SQL] improve the unsafe row writing framework

As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: **old version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` **new version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes #10809 from cloud-fan/unsafe-projection.

Diffstat (limited to 'data')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: