[SPARK-2443][SQL] Fix slow read from partitioned tables - spark

diff options

author	Zongheng Yang <zongheng.y@gmail.com>	2014-07-14 13:22:24 -0700
committer	Michael Armbrust <michael@databricks.com>	2014-07-14 13:22:24 -0700
commit	d60b09bb60cff106fa0acddebf35714503b20f03 (patch)
tree	ba3f8da65971bf5a6957d581dd22d2d44e8125e0 /docs/mllib-guide.md
parent	38ccd6ebd412cfbf82ae9d8a0998ff697db11455 (diff)
download	spark-d60b09bb60cff106fa0acddebf35714503b20f03.tar.gz spark-d60b09bb60cff106fa0acddebf35714503b20f03.tar.bz2 spark-d60b09bb60cff106fa0acddebf35714503b20f03.zip

[SPARK-2443][SQL] Fix slow read from partitioned tables

This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix. ## Benchmarks Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console. Without the fix: Type | Non-partitioned | Partitioned (1 part) ------------ | ------------ | ------------- First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s) Stablized runs | 1.21s (1.18s) | 27.6s (27.5s) With this fix: Type | Non-partitioned | Partitioned (1 part) ------------ | ------------ | ------------- First run | 9.57s (1.46s) | 11.0s (1.69s) Stablized runs | 1.13s (1.10s) | 1.23s (1.19s) Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits: d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop.

Diffstat (limited to 'docs/mllib-guide.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: