aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorDavies Liu <davies@databricks.com>2016-05-18 18:46:57 +0800
committerCheng Lian <lian@databricks.com>2016-05-18 18:46:57 +0800
commit33814f887aea339c99e14ce7f14ca6fcc6875015 (patch)
treed8c1ca64c13ebd7f21c3a333143e5b6d9fc355c8 /mllib
parent6e02aec44b9e5bc2ada55cb612f26e6ba000c23e (diff)
downloadspark-33814f887aea339c99e14ce7f14ca6fcc6875015.tar.gz
spark-33814f887aea339c99e14ce7f14ca6fcc6875015.tar.bz2
spark-33814f887aea339c99e14ce7f14ca6fcc6875015.zip
[SPARK-15307][SQL] speed up listing files for data source
## What changes were proposed in this pull request? Currently, listing files is very slow if there is thousands files, especially on local file system, because: 1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout. 2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow). This PR improve these by: 1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources. 2) Only create an JobConf once within one task. ## How was this patch tested? Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now). Author: Davies Liu <davies@databricks.com> Closes #13094 from davies/listing.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions