[SPARK-15307][SQL] speed up listing files for data source - spark

diff options

author	Davies Liu <davies@databricks.com>	2016-05-18 18:46:57 +0800
committer	Cheng Lian <lian@databricks.com>	2016-05-18 18:46:57 +0800
commit	33814f887aea339c99e14ce7f14ca6fcc6875015 (patch)
tree	d8c1ca64c13ebd7f21c3a333143e5b6d9fc355c8 /mllib
parent	6e02aec44b9e5bc2ada55cb612f26e6ba000c23e (diff)
download	spark-33814f887aea339c99e14ce7f14ca6fcc6875015.tar.gz spark-33814f887aea339c99e14ce7f14ca6fcc6875015.tar.bz2 spark-33814f887aea339c99e14ce7f14ca6fcc6875015.zip

[SPARK-15307][SQL] speed up listing files for data source

## What changes were proposed in this pull request? Currently, listing files is very slow if there is thousands files, especially on local file system, because: 1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout. 2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow). This PR improve these by: 1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources. 2) Only create an JobConf once within one task. ## How was this patch tested? Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now). Author: Davies Liu <davies@databricks.com> Closes #13094 from davies/listing.

Diffstat (limited to 'mllib')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: