[SPARK-17994][SQL] Add back a file status cache for catalog tables

## What changes were proposed in this pull request? In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions. However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions. The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read. ## How was this patch tested? Existing tests and new tests in `HiveTablePerfStatsSuite`. cc mallman Author: Eric Liang <ekl@databricks.com> Author: Michael Allman <michael@videoamp.com> Author: Eric Liang <ekhliang@gmail.com> Closes #15539 from ericl/meta-cache.
author: Eric Liang <ekl@databricks.com> 2016-10-22 22:08:28 +0800
committer: Wenchen Fan <wenchen@databricks.com> 2016-10-22 22:08:28 +0800
commit: 3eca283aca68ac81c127d60ad5699f854d5f14b7 (patch)
tree: 1846e569ede3f7774b9fca2d21c5b85dec2b885d /core/src
parent: ab3363e9f6b1f7fc26682509fe7382c570f91778 (diff)
download: spark-3eca283aca68ac81c127d60ad5699f854d5f14b7.tar.gz
spark-3eca283aca68ac81c127d60ad5699f854d5f14b7.tar.bz2
spark-3eca283aca68ac81c127d60ad5699f854d5f14b7.zip
1 files changed, 7 insertions, 0 deletions
diff --git a/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala b/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala
index cf92a10dea..b54885b7ff 100644
--- a/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala
+++ b/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala
@@ -81,14 +81,21 @@ object HiveCatalogMetrics extends Source {
   val METRIC_FILES_DISCOVERED = metricRegistry.counter(MetricRegistry.name("filesDiscovered"))
 
   /**
+   * Tracks the total number of files served from the file status cache instead of discovered.
+   */
+  val METRIC_FILE_CACHE_HITS = metricRegistry.counter(MetricRegistry.name("fileCacheHits"))
+
+  /**
    * Resets the values of all metrics to zero. This is useful in tests.
    */
   def reset(): Unit = {
     METRIC_PARTITIONS_FETCHED.dec(METRIC_PARTITIONS_FETCHED.getCount())
     METRIC_FILES_DISCOVERED.dec(METRIC_FILES_DISCOVERED.getCount())
+    METRIC_FILE_CACHE_HITS.dec(METRIC_FILE_CACHE_HITS.getCount())
   }
 
   // clients can use these to avoid classloader issues with the codahale classes
   def incrementFetchedPartitions(n: Int): Unit = METRIC_PARTITIONS_FETCHED.inc(n)
   def incrementFilesDiscovered(n: Int): Unit = METRIC_FILES_DISCOVERED.inc(n)
+  def incrementFileCacheHits(n: Int): Unit = METRIC_FILE_CACHE_HITS.inc(n)
 }
author	Eric Liang <ekl@databricks.com>	2016-10-22 22:08:28 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-10-22 22:08:28 +0800
commit	3eca283aca68ac81c127d60ad5699f854d5f14b7 (patch)
tree	1846e569ede3f7774b9fca2d21c5b85dec2b885d /core/src
parent	ab3363e9f6b1f7fc26682509fe7382c570f91778 (diff)
download	spark-3eca283aca68ac81c127d60ad5699f854d5f14b7.tar.gz spark-3eca283aca68ac81c127d60ad5699f854d5f14b7.tar.bz2 spark-3eca283aca68ac81c127d60ad5699f854d5f14b7.zip