[SPARK-15865][CORE] Blacklist should not result in job hanging with less than 4 executors - spark

diff options

author	Imran Rashid <irashid@cloudera.com>	2016-06-30 13:36:06 -0500
committer	Imran Rashid <irashid@cloudera.com>	2016-06-30 13:36:06 -0500
commit	fdf9f94f8c8861a00cd8415073f842b857c397f7 (patch)
tree	43a498dc10b47c2355b5e71c994e9de5b611ff36 /external
parent	07f46afc733b1718d528a6ea5c0d774f047024fa (diff)
download	spark-fdf9f94f8c8861a00cd8415073f842b857c397f7.tar.gz spark-fdf9f94f8c8861a00cd8415073f842b857c397f7.tar.bz2 spark-fdf9f94f8c8861a00cd8415073f842b857c397f7.zip

[SPARK-15865][CORE] Blacklist should not result in job hanging with less than 4 executors

## What changes were proposed in this pull request? Before this change, when you turn on blacklisting with `spark.scheduler.executorTaskBlacklistTime`, but you have fewer than `spark.task.maxFailures` executors, you can end with a job "hung" after some task failures. Whenever a taskset is unable to schedule anything on resourceOfferSingleTaskSet, we check whether the last pending task can be scheduled on *any* known executor. If not, the taskset (and any corresponding jobs) are failed. * Worst case, this is O(maxTaskFailures + numTasks). But unless many executors are bad, this should be small * This does not fail as fast as possible -- when a task becomes unschedulable, we keep scheduling other tasks. This is to avoid an O(numPendingTasks * numExecutors) operation * Also, it is conceivable this fails too quickly. You may be 1 millisecond away from unblacklisting a place for a task to run, or acquiring a new executor. ## How was this patch tested? Added unit test which failed before the change, ran new test 5k times manually, ran all scheduler tests manually, and the full suite via jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #13603 from squito/progress_w_few_execs_and_blacklist.

Diffstat (limited to 'external')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: