[SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch

We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9707 from zsxwing/fix-checkpoint.
author: Shixiong Zhu <shixiong@databricks.com> 2015-11-17 14:48:29 -0800
committer: Tathagata Das <tathagata.das1565@gmail.com> 2015-11-17 14:48:29 -0800
commit: 928d631625297857fb6998fbeb0696917fbfd60f (patch)
tree: 16c2446b3381accc01b58d577db60fed24fe387e /streaming/src/main
parent: 936bc0bcbf957fa1d7cb5cfe88d628c830df5981 (diff)
download: spark-928d631625297857fb6998fbeb0696917fbfd60f.tar.gz
spark-928d631625297857fb6998fbeb0696917fbfd60f.tar.bz2
spark-928d631625297857fb6998fbeb0696917fbfd60f.zip
1 files changed, 16 insertions, 2 deletions
diff --git a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
index 0cd55d9aec..fd0e8d5d69 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
@@ -187,16 +187,30 @@ class CheckpointWriter(
   private var stopped = false
   private var fs_ : FileSystem = _
 
+  @volatile private var latestCheckpointTime: Time = null
+
   class CheckpointWriteHandler(
       checkpointTime: Time,
       bytes: Array[Byte],
       clearCheckpointDataLater: Boolean) extends Runnable {
     def run() {
+      if (latestCheckpointTime == null || latestCheckpointTime < checkpointTime) {
+        latestCheckpointTime = checkpointTime
+      }
       var attempts = 0
       val startTime = System.currentTimeMillis()
       val tempFile = new Path(checkpointDir, "temp")
-      val checkpointFile = Checkpoint.checkpointFile(checkpointDir, checkpointTime)
-      val backupFile = Checkpoint.checkpointBackupFile(checkpointDir, checkpointTime)
+      // We will do checkpoint when generating a batch and completing a batch. When the processing
+      // time of a batch is greater than the batch interval, checkpointing for completing an old
+      // batch may run after checkpointing of a new batch. If this happens, checkpoint of an old
+      // batch actually has the latest information, so we want to recovery from it. Therefore, we
+      // also use the latest checkpoint time as the file name, so that we can recovery from the
+      // latest checkpoint file.
+      //
+      // Note: there is only one thread writting the checkpoint files, so we don't need to worry
+      // about thread-safety.
+      val checkpointFile = Checkpoint.checkpointFile(checkpointDir, latestCheckpointTime)
+      val backupFile = Checkpoint.checkpointBackupFile(checkpointDir, latestCheckpointTime)
 
       while (attempts < MAX_ATTEMPTS && !stopped) {
         attempts += 1
author	Shixiong Zhu <shixiong@databricks.com>	2015-11-17 14:48:29 -0800
committer	Tathagata Das <tathagata.das1565@gmail.com>	2015-11-17 14:48:29 -0800
commit	928d631625297857fb6998fbeb0696917fbfd60f (patch)
tree	16c2446b3381accc01b58d577db60fed24fe387e /streaming/src/main
parent	936bc0bcbf957fa1d7cb5cfe88d628c830df5981 (diff)
download	spark-928d631625297857fb6998fbeb0696917fbfd60f.tar.gz spark-928d631625297857fb6998fbeb0696917fbfd60f.tar.bz2 spark-928d631625297857fb6998fbeb0696917fbfd60f.zip