diff options
author | Cheng Lian <lian@databricks.com> | 2016-04-21 21:48:09 -0700 |
---|---|---|
committer | Davies Liu <davies.liu@gmail.com> | 2016-04-21 21:48:09 -0700 |
commit | 145433f1aaf4a58f484f98c2f1d32abd8cc95b48 (patch) | |
tree | e910fe2d65ccc045c2f0b1d8c2e796ae1bd84975 /repl/src/main | |
parent | b29bc3f51518806ef7827b35df7c8aada329f961 (diff) | |
download | spark-145433f1aaf4a58f484f98c2f1d32abd8cc95b48.tar.gz spark-145433f1aaf4a58f484f98c2f1d32abd8cc95b48.tar.bz2 spark-145433f1aaf4a58f484f98c2f1d32abd8cc95b48.zip |
[SPARK-14369] [SQL] Locality support for FileScanRDD
(This PR is a rebased version of PR #12153.)
## What changes were proposed in this pull request?
This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts:
1. Block location lookup
Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es.
Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems.
2. Selecting preferred locations
For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs.
## How was this patch tested?
Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations.
Author: Cheng Lian <lian@databricks.com>
Closes #12527 from liancheng/spark-14369-locality-rebased.
Diffstat (limited to 'repl/src/main')
0 files changed, 0 insertions, 0 deletions