Merge pull request #20 from harveyfeng/hadoop-config-cache - spark

diff options

author	Matei Zaharia <matei@eecs.berkeley.edu>	2013-10-05 19:28:55 -0700
committer	Matei Zaharia <matei@eecs.berkeley.edu>	2013-10-05 19:28:55 -0700
commit	4a25b116d4e451afdf10fc4f018c383ed2c7789a (patch)
tree	b495bf796170a7c4608ac5332539bbe103effb7c /tools
parent	8fc68d04bdea2a9bb895cb149b1b8e77d2ce0c19 (diff)
parent	6a2bbec5e3840cea5c128d521fe91050de8689db (diff)
download	spark-4a25b116d4e451afdf10fc4f018c383ed2c7789a.tar.gz spark-4a25b116d4e451afdf10fc4f018c383ed2c7789a.tar.bz2 spark-4a25b116d4e451afdf10fc4f018c383ed2c7789a.zip

Merge pull request #20 from harveyfeng/hadoop-config-cache

Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads. Note: originally from https://github.com/mesos/spark/pull/942 Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same. This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now). As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute(). Few other notes: Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv. SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.

Diffstat (limited to 'tools')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: