aboutsummaryrefslogtreecommitdiff
path: root/R
diff options
context:
space:
mode:
authorShixiong Zhu <shixiong@databricks.com>2016-03-10 16:59:14 -0800
committerAndrew Or <andrew@databricks.com>2016-03-10 16:59:14 -0800
commit27fe6bacc532184ef6e8a2a24cd07f2c9188004e (patch)
tree09325bc31c8c8d25dd052bf6b8e362926e96d3bc /R
parent020ff8cd34b74de31e878082b8e18005f61f1f77 (diff)
downloadspark-27fe6bacc532184ef6e8a2a24cd07f2c9188004e.tar.gz
spark-27fe6bacc532184ef6e8a2a24cd07f2c9188004e.tar.bz2
spark-27fe6bacc532184ef6e8a2a24cd07f2c9188004e.zip
[SPARK-13604][CORE] Sync worker's state after registering with master
## What changes were proposed in this pull request? Here lists all cases that Master cannot talk with Worker for a while and then network is back. 1. Master doesn't know the network issue (not yet timeout) a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered) b. Worker knows the network issue (onDisconnected is called) - Worker stops sending Heartbeat and sends `RegisterWorker` to master. Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See [SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602)) 2. Worker timeout (Master knows the network issue). In such case, master removes Worker and its executors and drivers. a. Worker doesn't know the network issue (onDisconnected is not called) - Worker keeps sending Heartbeat. - If the network is back, say Master receives Heartbeat, Master sends `ReconnectWorker` to Worker - Worker send `RegisterWorker` to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) b. Worker knows the network issue (onDisconnected is called) - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to master. - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors) This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send `WorkerLatestState` to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers. Note: Worker cannot just kill executors after registering with master because in the worker, `LaunchExecutor` and `RegisteredWorker` are processed in two threads. If `LaunchExecutor` happens before `RegisteredWorker`, Worker's executor list will contain new executors after Master accepts `RegisterWorker`. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed. ## How was this patch tested? test("SPARK-13604: Master should ask Worker kill unknown executors and drivers") Author: Shixiong Zhu <shixiong@databricks.com> Closes #11455 from zsxwing/orphan-executors.
Diffstat (limited to 'R')
0 files changed, 0 insertions, 0 deletions