Standalone Scheduler fault recovery - spark

diff options

author	Aaron Davidson <aaron@databricks.com>	2013-09-17 09:40:06 -0700
committer	Aaron Davidson <aaron@databricks.com>	2013-09-26 14:59:35 -0700
commit	d5a96feccb15dd290b282af9e2f94479c8e4554e (patch)
tree	55146010e613178553ff6fd1bc35e5d4d53addcf /project/SparkBuild.scala
parent	13eced723f222095ea4b52c4f6cb078cae66342e (diff)
download	spark-d5a96feccb15dd290b282af9e2f94479c8e4554e.tar.gz spark-d5a96feccb15dd290b282af9e2f94479c8e4554e.tar.bz2 spark-d5a96feccb15dd290b282af9e2f94479c8e4554e.zip

Standalone Scheduler fault recovery

Implements a basic form of Standalone Scheduler fault recovery. In particular, this allows faults to be manually recovered from by means of restarting the Master process on the same machine. This is the majority of the code necessary for general fault tolerance, which will first elect a leader and then recover the Master state. In order to enable fault recovery, the Master will persist a small amount of state related to the registration of Workers and Applications to disk. If the Master is started and sees that this state is still around, it will enter Recovery mode, during which time it will not schedule any new Executors on Workers (but it does accept the registration of new Clients and Workers). At this point, the Master attempts to reconnect to all Workers and Client applications that were registered at the time of failure. After confirming either the existence or nonexistence of all such nodes (within a certain timeout), the Master will exit Recovery mode and resume normal scheduling.

Diffstat (limited to 'project/SparkBuild.scala')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: