Merge pull request #383 from tdas/driver-test

API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.
author: Patrick Wendell <pwendell@gmail.com> 2014-01-10 16:25:44 -0800
committer: Patrick Wendell <pwendell@gmail.com> 2014-01-10 16:25:44 -0800
commit: f26553102c1995acf2a2ba6b502de4f2dbbd73b3 (patch)
tree: 41f49a631d96befabec2be8d8ba8803ac33c20e1 /core/src/main
parent: d37408f39ca3fd94f45b50a65f919f4d7007a533 (diff)
parent: 4f39e79c23b32a411a0d5fdc86b5c17ab2250f8d (diff)
download: spark-f26553102c1995acf2a2ba6b502de4f2dbbd73b3.tar.gz
spark-f26553102c1995acf2a2ba6b502de4f2dbbd73b3.tar.bz2
spark-f26553102c1995acf2a2ba6b502de4f2dbbd73b3.zip
2 files changed, 13 insertions, 6 deletions
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 1647d904a2..139048d5c7 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -1086,7 +1086,7 @@ object SparkContext {
    * parameters that are passed as the default value of null, instead of throwing an exception
    * like SparkConf would.
    */
-  private def updatedConf(
+  private[spark] def updatedConf(
       conf: SparkConf,
       master: String,
       appName: String,
diff --git a/core/src/main/scala/org/apache/spark/util/TimeStampedHashMap.scala b/core/src/main/scala/org/apache/spark/util/TimeStampedHashMap.scala
index 181ae2fd45..8e07a0f29a 100644
--- a/core/src/main/scala/org/apache/spark/util/TimeStampedHashMap.scala
+++ b/core/src/main/scala/org/apache/spark/util/TimeStampedHashMap.scala
@@ -26,16 +26,23 @@ import org.apache.spark.Logging
 
 /**
  * This is a custom implementation of scala.collection.mutable.Map which stores the insertion
- * time stamp along with each key-value pair. Key-value pairs that are older than a particular
- * threshold time can them be removed using the clearOldValues method. This is intended to be a drop-in
- * replacement of scala.collection.mutable.HashMap.
+ * timestamp along with each key-value pair. If specified, the timestamp of each pair can be
+ * updated every time it is accessed. Key-value pairs whose timestamp are older than a particular
+ * threshold time can then be removed using the clearOldValues method. This is intended to
+ * be a drop-in replacement of scala.collection.mutable.HashMap.
+ * @param updateTimeStampOnGet When enabled, the timestamp of a pair will be
+ *                             updated when it is accessed
  */
-class TimeStampedHashMap[A, B] extends Map[A, B]() with Logging {
+class TimeStampedHashMap[A, B](updateTimeStampOnGet: Boolean = false)
+  extends Map[A, B]() with Logging {
   val internalMap = new ConcurrentHashMap[A, (B, Long)]()
 
   def get(key: A): Option[B] = {
     val value = internalMap.get(key)
-    if (value != null) Some(value._1) else None
+    if (value != null && updateTimeStampOnGet) {
+      internalMap.replace(key, value, (value._1, currentTime))
+    }
+    Option(value).map(_._1)
   }
 
   def iterator: Iterator[(A, B)] = {
author	Patrick Wendell <pwendell@gmail.com>	2014-01-10 16:25:44 -0800
committer	Patrick Wendell <pwendell@gmail.com>	2014-01-10 16:25:44 -0800
commit	f26553102c1995acf2a2ba6b502de4f2dbbd73b3 (patch)
tree	41f49a631d96befabec2be8d8ba8803ac33c20e1 /core/src/main
parent	d37408f39ca3fd94f45b50a65f919f4d7007a533 (diff)
parent	4f39e79c23b32a411a0d5fdc86b5c17ab2250f8d (diff)
download	spark-f26553102c1995acf2a2ba6b502de4f2dbbd73b3.tar.gz spark-f26553102c1995acf2a2ba6b502de4f2dbbd73b3.tar.bz2 spark-f26553102c1995acf2a2ba6b502de4f2dbbd73b3.zip