diff options
author | Imran Rashid <irashid@cloudera.com> | 2016-10-12 16:43:03 -0500 |
---|---|---|
committer | Imran Rashid <irashid@cloudera.com> | 2016-10-12 16:43:03 -0500 |
commit | 9ce7d3e542e786c62f047c13f3001e178f76e06a (patch) | |
tree | 6d43e48a1d969fb70347b8540b0bb50e4456b6d6 /docs/configuration.md | |
parent | 47776e7c0c68590fe446cef910900b1aaead06f9 (diff) | |
download | spark-9ce7d3e542e786c62f047c13f3001e178f76e06a.tar.gz spark-9ce7d3e542e786c62f047c13f3001e178f76e06a.tar.bz2 spark-9ce7d3e542e786c62f047c13f3001e178f76e06a.zip |
[SPARK-17675][CORE] Expand Blacklist for TaskSets
## What changes were proposed in this pull request?
This is a step along the way to SPARK-8425.
To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for
* (task, executor) pairs (this already exists via an undocumented config)
* (task, node)
* (taskset, executor)
* (taskset, node)
Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster.
Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be).
## How was this patch tested?
Added unit tests, run tests via jenkins.
Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>
Closes #15249 from squito/taskset_blacklist_only.
Diffstat (limited to 'docs/configuration.md')
-rw-r--r-- | docs/configuration.md | 43 |
1 files changed, 43 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md index 82ce232b33..373e22d71a 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1246,6 +1246,49 @@ Apart from these, the following properties are also available, and may be useful </td> </tr> <tr> + <td><code>spark.blacklist.enabled</code></td> + <td> + false + </td> + <td> + If set to "true", prevent Spark from scheduling tasks on executors that have been blacklisted + due to too many task failures. The blacklisting algorithm can be further controlled by the + other "spark.blacklist" configuration options. + </td> +</tr> +<tr> + <td><code>spark.blacklist.task.maxTaskAttemptsPerExecutor</code></td> + <td>1</td> + <td> + (Experimental) For a given task, how many times it can be retried on one executor before the + executor is blacklisted for that task. + </td> +</tr> +<tr> + <td><code>spark.blacklist.task.maxTaskAttemptsPerNode</code></td> + <td>2</td> + <td> + (Experimental) For a given task, how many times it can be retried on one node, before the entire + node is blacklisted for that task. + </td> +</tr> +<tr> + <td><code>spark.blacklist.stage.maxFailedTasksPerExecutor</code> + <td>2</td> + <td> + (Experimental) How many different tasks must fail on one executor, within one stage, before the + executor is blacklisted for that stage. + </td> +</tr> +<tr> + <td><code>spark.blacklist.stage.maxFailedExecutorsPerNode</code></td> + <td>2</td> + <td> + (Experimental) How many different executors are marked as blacklisted for a given stage, before + the entire node is marked as failed for the stage. + </td> +</tr> +<tr> <td><code>spark.speculation</code></td> <td>false</td> <td> |