[SPARK-15262] Synchronize block manager / scheduler executor state

## What changes were proposed in this pull request? If an executor is still alive even after the scheduler has removed its metadata, we may receive a heartbeat from that executor and tell its block manager to reregister itself. If that happens, the block manager master will know about the executor, but the scheduler will not. That is a dangerous situation, because when the executor does get disconnected later, the scheduler will not ask the block manager to also remove metadata for that executor. Later, when we try to clean up an RDD or a broadcast variable, we may try to send a message to that executor, triggering an exception. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #13055 from andrewor14/block-manager-remove.
author: Andrew Or <andrew@databricks.com> 2016-05-11 13:36:58 -0700
committer: Shixiong Zhu <shixiong@databricks.com> 2016-05-11 13:36:58 -0700
commit: 40a949aae9c3040019a52482d091912a85b0f4d4 (patch)
tree: 78c59d6b281b0dfc2f1bbc3f60ecf179680493c9 /core
parent: 7ecd496884f6f126ab186b9ceaa861a571d6155c (diff)
download: spark-40a949aae9c3040019a52482d091912a85b0f4d4.tar.gz
spark-40a949aae9c3040019a52482d091912a85b0f4d4.tar.bz2
spark-40a949aae9c3040019a52482d091912a85b0f4d4.zip
1 files changed, 8 insertions, 1 deletions
diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
index 8896391f97..0fea9c123b 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
@@ -289,7 +289,14 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
           scheduler.executorLost(executorId, if (killed) ExecutorKilled else reason)
           listenerBus.post(
             SparkListenerExecutorRemoved(System.currentTimeMillis(), executorId, reason.toString))
-        case None => logInfo(s"Asked to remove non-existent executor $executorId")
+        case None =>
+          // SPARK-15262: If an executor is still alive even after the scheduler has removed
+          // its metadata, we may receive a heartbeat from that executor and tell its block
+          // manager to reregister itself. If that happens, the block manager master will know
+          // about the executor, but the scheduler will not. Therefore, we should remove the
+          // executor from the block manager when we hit this case.
+          scheduler.sc.env.blockManager.master.removeExecutor(executorId)
+          logInfo(s"Asked to remove non-existent executor $executorId")
       }
     }
author	Andrew Or <andrew@databricks.com>	2016-05-11 13:36:58 -0700
committer	Shixiong Zhu <shixiong@databricks.com>	2016-05-11 13:36:58 -0700
commit	40a949aae9c3040019a52482d091912a85b0f4d4 (patch)
tree	78c59d6b281b0dfc2f1bbc3f60ecf179680493c9 /core
parent	7ecd496884f6f126ab186b9ceaa861a571d6155c (diff)
download	spark-40a949aae9c3040019a52482d091912a85b0f4d4.tar.gz spark-40a949aae9c3040019a52482d091912a85b0f4d4.tar.bz2 spark-40a949aae9c3040019a52482d091912a85b0f4d4.zip