diff options
author | Davies Liu <davies@databricks.com> | 2016-05-18 18:46:57 +0800 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2016-05-18 18:46:57 +0800 |
commit | 33814f887aea339c99e14ce7f14ca6fcc6875015 (patch) | |
tree | d8c1ca64c13ebd7f21c3a333143e5b6d9fc355c8 /mllib | |
parent | 6e02aec44b9e5bc2ada55cb612f26e6ba000c23e (diff) | |
download | spark-33814f887aea339c99e14ce7f14ca6fcc6875015.tar.gz spark-33814f887aea339c99e14ce7f14ca6fcc6875015.tar.bz2 spark-33814f887aea339c99e14ce7f14ca6fcc6875015.zip |
[SPARK-15307][SQL] speed up listing files for data source
## What changes were proposed in this pull request?
Currently, listing files is very slow if there is thousands files, especially on local file system, because:
1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout.
2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow).
This PR improve these by:
1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources.
2) Only create an JobConf once within one task.
## How was this patch tested?
Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now).
Author: Davies Liu <davies@databricks.com>
Closes #13094 from davies/listing.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions