[SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places: * The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched * The JobConf is serialized as part of the DStream checkpoints. These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object. The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf. Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #10088 from tdas/SPARK-12087.
author: Tathagata Das <tathagata.das1565@gmail.com> 2015-12-01 21:04:52 -0800
committer: Shixiong Zhu <shixiong@databricks.com> 2015-12-01 21:04:52 -0800
commit: 8a75a3049539eeef04c0db51736e97070c162b46 (patch)
tree: 1721ce076c393cd28288cb33f85ad104bfbcdd5b /streaming
parent: 96691feae0229fd693c29475620be2c4059dd080 (diff)
download: spark-8a75a3049539eeef04c0db51736e97070c162b46.tar.gz
spark-8a75a3049539eeef04c0db51736e97070c162b46.tar.bz2
spark-8a75a3049539eeef04c0db51736e97070c162b46.zip
1 files changed, 2 insertions, 1 deletions
diff --git a/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala b/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala
index fb691eed27..2762309134 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala
@@ -730,7 +730,8 @@ class PairDStreamFunctions[K, V](self: DStream[(K, V)])
     val serializableConf = new SerializableJobConf(conf)
     val saveFunc = (rdd: RDD[(K, V)], time: Time) => {
       val file = rddToFileName(prefix, suffix, time)
-      rdd.saveAsHadoopFile(file, keyClass, valueClass, outputFormatClass, serializableConf.value)
+      rdd.saveAsHadoopFile(file, keyClass, valueClass, outputFormatClass,
+        new JobConf(serializableConf.value))
     }
     self.foreachRDD(saveFunc)
   }
author	Tathagata Das <tathagata.das1565@gmail.com>	2015-12-01 21:04:52 -0800
committer	Shixiong Zhu <shixiong@databricks.com>	2015-12-01 21:04:52 -0800
commit	8a75a3049539eeef04c0db51736e97070c162b46 (patch)
tree	1721ce076c393cd28288cb33f85ad104bfbcdd5b /streaming
parent	96691feae0229fd693c29475620be2c4059dd080 (diff)
download	spark-8a75a3049539eeef04c0db51736e97070c162b46.tar.gz spark-8a75a3049539eeef04c0db51736e97070c162b46.tar.bz2 spark-8a75a3049539eeef04c0db51736e97070c162b46.zip