aboutsummaryrefslogtreecommitdiff
path: root/docs/ec2-scripts.md
diff options
context:
space:
mode:
authorAndrew Or <andrew@databricks.com>2014-11-25 15:46:26 -0800
committerAndrew Or <andrew@databricks.com>2014-11-25 15:46:26 -0800
commit1b2ab1cd1b7cab9076f3c511188a610eda935701 (patch)
tree1a4503dd21aeec0976670dc89a51c11881e923c3 /docs/ec2-scripts.md
parent8838ad7c135a585cde015dc38b5cb23314502dd9 (diff)
downloadspark-1b2ab1cd1b7cab9076f3c511188a610eda935701.tar.gz
spark-1b2ab1cd1b7cab9076f3c511188a610eda935701.tar.bz2
spark-1b2ab1cd1b7cab9076f3c511188a610eda935701.zip
[SPARK-4592] Avoid duplicate worker registrations in standalone mode
**Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit https://github.com/apache/spark/commit/4afe9a4852ebeb4cc77322a14225cd3dec165f3f, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur: (1) Master A fails and Worker attempts to reconnect to all masters (2) Master B takes over and notifies Worker (3) Worker responds by registering with Master B (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice **Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one. **Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible. Author: Andrew Or <andrew@databricks.com> Closes #3447 from andrewor14/standalone-failover and squashes the following commits: 0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety 79286dc [Andrew Or] Preserve old behavior for initial retries 83b321c [Andrew Or] Tweak wording 1fce6a9 [Andrew Or] Active master actor could be null in the beginning b6f269e [Andrew Or] Avoid duplicate worker registrations
Diffstat (limited to 'docs/ec2-scripts.md')
0 files changed, 0 insertions, 0 deletions