diff options
author | windpiger <songjun@outlook.com> | 2017-03-02 23:54:01 -0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2017-03-02 23:54:01 -0800 |
commit | 982f3223b4f55f988091402063fe8746c5e2cee4 (patch) | |
tree | 3054ad65a839775ae1478e2cf1eadadd4373ee7e /sql/hive | |
parent | e24f21b5f8365ed25346e986748b393e0b4be25c (diff) | |
download | spark-982f3223b4f55f988091402063fe8746c5e2cee4.tar.gz spark-982f3223b4f55f988091402063fe8746c5e2cee4.tar.bz2 spark-982f3223b4f55f988091402063fe8746c5e2cee4.zip |
[SPARK-18726][SQL] resolveRelation for FileFormat DataSource don't need to listFiles twice
## What changes were proposed in this pull request?
Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles` twice in `InMemoryFileIndex` during `resolveRelation`.
This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice.
But there is a bug in `InMemoryFileIndex` see:
[SPARK-19748](https://github.com/apache/spark/pull/17079)
[SPARK-19761](https://github.com/apache/spark/pull/17093),
so this pr should be after SPARK-19748/ SPARK-19761.
## How was this patch tested?
unit test added
Author: windpiger <songjun@outlook.com>
Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.
Diffstat (limited to 'sql/hive')
-rw-r--r-- | sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala | 11 |
1 files changed, 11 insertions, 0 deletions
diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala index b792a168a4..50506197b3 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala @@ -411,4 +411,15 @@ class PartitionedTablePerfStatsSuite } } } + + test("resolveRelation for a FileFormat DataSource without userSchema scan filesystem only once") { + withTempDir { dir => + import spark.implicits._ + Seq(1).toDF("a").write.mode("overwrite").save(dir.getAbsolutePath) + HiveCatalogMetrics.reset() + spark.read.parquet(dir.getAbsolutePath) + assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 1) + assert(HiveCatalogMetrics.METRIC_FILE_CACHE_HITS.getCount() == 1) + } + } } |