[SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2016-08-30 13:15:21 -0700
committer	Josh Rosen <joshrosen@databricks.com>	2016-08-30 13:15:21 -0700
commit	fb20084313470593d8507a43fcb2cde2a4c854d9 (patch)
tree	73b6c2d425a3addf965289604d70d2d92a737139 /common
parent	4b4e329e49f8af28fa6301bd06c48d7097eaf9e6 (diff)
download	spark-fb20084313470593d8507a43fcb2cde2a4c854d9.tar.gz spark-fb20084313470593d8507a43fcb2cde2a4c854d9.tar.bz2 spark-fb20084313470593d8507a43fcb2cde2a4c854d9.zip

[SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted

This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run ``` sc.parallelize(1 to 100000, 100000).map(identity).count() ``` then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`: ![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png) When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels: ```scala for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) { do { launchedTask = resourceOfferSingleTaskSet( taskSet, maxLocality, shuffledOffers, availableCpus, tasks) } while (launchedTask) } ``` In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set. Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level. Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change. /cc squito and kayousterhout for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #14871 from JoshRosen/bail-early-if-no-cpus.

Diffstat (limited to 'common')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: