[SPARK-3091] [SQL] Add support for caching metadata on Parquet files - spark

diff options

author	Matei Zaharia <matei@databricks.com>	2014-08-18 11:00:10 -0700
committer	Michael Armbrust <michael@databricks.com>	2014-08-18 11:00:10 -0700
commit	9eb74c7d2cbe127dd4c32bf1a8318497b2fb55b6 (patch)
tree	85a180feecc5b4770933a470a0444ae16127ad8e /project
parent	6bca8898a1aa4ca7161492229bac1748b3da2ad7 (diff)
download	spark-9eb74c7d2cbe127dd4c32bf1a8318497b2fb55b6.tar.gz spark-9eb74c7d2cbe127dd4c32bf1a8318497b2fb55b6.tar.bz2 spark-9eb74c7d2cbe127dd4c32bf1a8318497b2fb55b6.zip

[SPARK-3091] [SQL] Add support for caching metadata on Parquet files

For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them. Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes. Author: Matei Zaharia <matei@databricks.com> Closes #2005 from mateiz/parquet-cache and squashes the following commits: dae8efe [Matei Zaharia] Bug fix c71e9ed [Matei Zaharia] Handle empty statuses directly 22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata 8fb56ce [Matei Zaharia] Cache file block locations too 453bd21 [Matei Zaharia] Bug fix 4094df6 [Matei Zaharia] First attempt at caching Parquet footers

Diffstat (limited to 'project')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: