[SPARK-11718][YARN][CORE] Fix explicitly killed executor dies silently issue

Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work. The problem is `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done. One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple. Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query. Here this PR chooses solution 2. Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9684 from jerryshao/SPARK-11718.
author: jerryshao <sshao@hortonworks.com> 2015-11-16 11:43:18 -0800
committer: Marcelo Vanzin <vanzin@cloudera.com> 2015-11-16 11:43:18 -0800
commit: 24477d2705bcf2a851acc241deb8376c5450dc73 (patch)
tree: ce7a35acdc88edc37dba68796aa7e40385e77264 /core
parent: ace0db47141ffd457c2091751038fc291f6d5a8b (diff)
download: spark-24477d2705bcf2a851acc241deb8376c5450dc73.tar.gz
spark-24477d2705bcf2a851acc241deb8376c5450dc73.tar.bz2
spark-24477d2705bcf2a851acc241deb8376c5450dc73.zip
2 files changed, 5 insertions, 2 deletions
diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
index 43d7d80b7a..5f136690f4 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
@@ -473,6 +473,7 @@ private[spark] class TaskSchedulerImpl(
              // If the host mapping still exists, it means we don't know the loss reason for the
              // executor. So call removeExecutor() to update tasks running on that executor when
              // the real loss reason is finally known.
+             logError(s"Actual reason for lost executor $executorId: ${reason.message}")
              removeExecutor(executorId, reason)
 
            case None =>
diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
index f71d98feac..3373caf0d1 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
@@ -269,7 +269,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
      * Stop making resource offers for the given executor. The executor is marked as lost with
      * the loss reason still pending.
      *
-     * @return Whether executor was alive.
+     * @return Whether executor should be disabled
      */
     protected def disableExecutor(executorId: String): Boolean = {
       val shouldDisable = CoarseGrainedSchedulerBackend.this.synchronized {
@@ -277,7 +277,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
           executorsPendingLossReason += executorId
           true
         } else {
-          false
+          // Returns true for explicitly killed executors, we also need to get pending loss reasons;
+          // For others return false.
+          executorsPendingToRemove.contains(executorId)
         }
       }
author	jerryshao <sshao@hortonworks.com>	2015-11-16 11:43:18 -0800
committer	Marcelo Vanzin <vanzin@cloudera.com>	2015-11-16 11:43:18 -0800
commit	24477d2705bcf2a851acc241deb8376c5450dc73 (patch)
tree	ce7a35acdc88edc37dba68796aa7e40385e77264 /core
parent	ace0db47141ffd457c2091751038fc291f6d5a8b (diff)
download	spark-24477d2705bcf2a851acc241deb8376c5450dc73.tar.gz spark-24477d2705bcf2a851acc241deb8376c5450dc73.tar.bz2 spark-24477d2705bcf2a851acc241deb8376c5450dc73.zip