aboutsummaryrefslogtreecommitdiff
path: root/docs/configuration.md
diff options
context:
space:
mode:
authorImran Rashid <irashid@cloudera.com>2016-10-12 16:43:03 -0500
committerImran Rashid <irashid@cloudera.com>2016-10-12 16:43:03 -0500
commit9ce7d3e542e786c62f047c13f3001e178f76e06a (patch)
tree6d43e48a1d969fb70347b8540b0bb50e4456b6d6 /docs/configuration.md
parent47776e7c0c68590fe446cef910900b1aaead06f9 (diff)
downloadspark-9ce7d3e542e786c62f047c13f3001e178f76e06a.tar.gz
spark-9ce7d3e542e786c62f047c13f3001e178f76e06a.tar.bz2
spark-9ce7d3e542e786c62f047c13f3001e178f76e06a.zip
[SPARK-17675][CORE] Expand Blacklist for TaskSets
## What changes were proposed in this pull request? This is a step along the way to SPARK-8425. To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for * (task, executor) pairs (this already exists via an undocumented config) * (task, node) * (taskset, executor) * (taskset, node) Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster. Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be). ## How was this patch tested? Added unit tests, run tests via jenkins. Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes #15249 from squito/taskset_blacklist_only.
Diffstat (limited to 'docs/configuration.md')
-rw-r--r--docs/configuration.md43
1 files changed, 43 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 82ce232b33..373e22d71a 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1246,6 +1246,49 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.blacklist.enabled</code></td>
+ <td>
+ false
+ </td>
+ <td>
+ If set to "true", prevent Spark from scheduling tasks on executors that have been blacklisted
+ due to too many task failures. The blacklisting algorithm can be further controlled by the
+ other "spark.blacklist" configuration options.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.blacklist.task.maxTaskAttemptsPerExecutor</code></td>
+ <td>1</td>
+ <td>
+ (Experimental) For a given task, how many times it can be retried on one executor before the
+ executor is blacklisted for that task.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.blacklist.task.maxTaskAttemptsPerNode</code></td>
+ <td>2</td>
+ <td>
+ (Experimental) For a given task, how many times it can be retried on one node, before the entire
+ node is blacklisted for that task.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.blacklist.stage.maxFailedTasksPerExecutor</code>
+ <td>2</td>
+ <td>
+ (Experimental) How many different tasks must fail on one executor, within one stage, before the
+ executor is blacklisted for that stage.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.blacklist.stage.maxFailedExecutorsPerNode</code></td>
+ <td>2</td>
+ <td>
+ (Experimental) How many different executors are marked as blacklisted for a given stage, before
+ the entire node is marked as failed for the stage.
+ </td>
+</tr>
+<tr>
<td><code>spark.speculation</code></td>
<td>false</td>
<td>