aboutsummaryrefslogtreecommitdiff
path: root/external
diff options
context:
space:
mode:
authorImran Rashid <irashid@cloudera.com>2016-06-30 13:36:06 -0500
committerImran Rashid <irashid@cloudera.com>2016-06-30 13:36:06 -0500
commitfdf9f94f8c8861a00cd8415073f842b857c397f7 (patch)
tree43a498dc10b47c2355b5e71c994e9de5b611ff36 /external
parent07f46afc733b1718d528a6ea5c0d774f047024fa (diff)
downloadspark-fdf9f94f8c8861a00cd8415073f842b857c397f7.tar.gz
spark-fdf9f94f8c8861a00cd8415073f842b857c397f7.tar.bz2
spark-fdf9f94f8c8861a00cd8415073f842b857c397f7.zip
[SPARK-15865][CORE] Blacklist should not result in job hanging with less than 4 executors
## What changes were proposed in this pull request? Before this change, when you turn on blacklisting with `spark.scheduler.executorTaskBlacklistTime`, but you have fewer than `spark.task.maxFailures` executors, you can end with a job "hung" after some task failures. Whenever a taskset is unable to schedule anything on resourceOfferSingleTaskSet, we check whether the last pending task can be scheduled on *any* known executor. If not, the taskset (and any corresponding jobs) are failed. * Worst case, this is O(maxTaskFailures + numTasks). But unless many executors are bad, this should be small * This does not fail as fast as possible -- when a task becomes unschedulable, we keep scheduling other tasks. This is to avoid an O(numPendingTasks * numExecutors) operation * Also, it is conceivable this fails too quickly. You may be 1 millisecond away from unblacklisting a place for a task to run, or acquiring a new executor. ## How was this patch tested? Added unit test which failed before the change, ran new test 5k times manually, ran all scheduler tests manually, and the full suite via jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #13603 from squito/progress_w_few_execs_and_blacklist.
Diffstat (limited to 'external')
0 files changed, 0 insertions, 0 deletions