[SPARK-14259] [SQL] Merging small files together based on the cost of opening - spark

diff options

author	Davies Liu <davies@databricks.com>	2016-04-04 14:41:03 -0700
committer	Davies Liu <davies.liu@gmail.com>	2016-04-04 14:41:03 -0700
commit	400b2f863ffaa01a34a8dae1541c61526fef908b (patch)
tree	eb0773854538319d9534c2ebdb36a9eb65f513ae /project
parent	cc70f174169f45c85d459126a68bbe43c0bec328 (diff)
download	spark-400b2f863ffaa01a34a8dae1541c61526fef908b.tar.gz spark-400b2f863ffaa01a34a8dae1541c61526fef908b.tar.bz2 spark-400b2f863ffaa01a34a8dae1541c61526fef908b.zip

[SPARK-14259] [SQL] Merging small files together based on the cost of opening

## What changes were proposed in this pull request? This PR basically re-do the things in #12068 but with a different model, which should work better in case of small files with different sizes. ## How was this patch tested? Updated existing tests. Ran a query on thousands of partitioned small files locally, with all default settings (the cost to open a file should be over estimated), the durations of tasks become smaller and smaller, which is good (the last few tasks will be shortest). Author: Davies Liu <davies@databricks.com> Closes #12095 from davies/file_cost.

Diffstat (limited to 'project')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: