aboutsummaryrefslogtreecommitdiff
path: root/repl/src/main/scala/org/apache
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2016-04-21 21:48:09 -0700
committerDavies Liu <davies.liu@gmail.com>2016-04-21 21:48:09 -0700
commit145433f1aaf4a58f484f98c2f1d32abd8cc95b48 (patch)
treee910fe2d65ccc045c2f0b1d8c2e796ae1bd84975 /repl/src/main/scala/org/apache
parentb29bc3f51518806ef7827b35df7c8aada329f961 (diff)
downloadspark-145433f1aaf4a58f484f98c2f1d32abd8cc95b48.tar.gz
spark-145433f1aaf4a58f484f98c2f1d32abd8cc95b48.tar.bz2
spark-145433f1aaf4a58f484f98c2f1d32abd8cc95b48.zip
[SPARK-14369] [SQL] Locality support for FileScanRDD
(This PR is a rebased version of PR #12153.) ## What changes were proposed in this pull request? This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts: 1. Block location lookup Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es. Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems. 2. Selecting preferred locations For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs. ## How was this patch tested? Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations. Author: Cheng Lian <lian@databricks.com> Closes #12527 from liancheng/spark-14369-locality-rebased.
Diffstat (limited to 'repl/src/main/scala/org/apache')
0 files changed, 0 insertions, 0 deletions