aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorImran Rashid <irashid@cloudera.com>2016-12-15 08:29:56 -0600
committerImran Rashid <irashid@cloudera.com>2016-12-15 08:29:56 -0600
commit93cdb8a7d0f124b4db069fd8242207c82e263c52 (patch)
treec0f626664bfa6bad965b85a3cc54438bf15b4332 /docs
parent7d858bc5ce870a28a559f4e81dcfc54cbd128cb7 (diff)
downloadspark-93cdb8a7d0f124b4db069fd8242207c82e263c52.tar.gz
spark-93cdb8a7d0f124b4db069fd8242207c82e263c52.tar.bz2
spark-93cdb8a7d0f124b4db069fd8242207c82e263c52.zip
[SPARK-8425][CORE] Application Level Blacklisting
## What changes were proposed in this pull request? This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application. Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout. Full details are available in a design doc attached to the jira. ## How was this patch tested? Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness. The added tests include: - verifying BlacklistTracker works correctly - verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker) - an integration test for the entire scheduler with blacklisting in a few different scenarios Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes #14079 from squito/blacklist-SPARK-8425.
Diffstat (limited to 'docs')
-rw-r--r--docs/configuration.md30
1 files changed, 30 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 7e466d7dc1..07bcd4aa7f 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1316,6 +1316,14 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.blacklist.timeout</code></td>
+ <td>1h</td>
+ <td>
+ (Experimental) How long a node or executor is blacklisted for the entire application, before it
+ is unconditionally removed from the blacklist to attempt running new tasks.
+ </td>
+</tr>
+<tr>
<td><code>spark.blacklist.task.maxTaskAttemptsPerExecutor</code></td>
<td>1</td>
<td>
@@ -1348,6 +1356,28 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.blacklist.application.maxFailedTasksPerExecutor</code></td>
+ <td>2</td>
+ <td>
+ (Experimental) How many different tasks must fail on one executor, in successful task sets,
+ before the executor is blacklisted for the entire application. Blacklisted executors will
+ be automatically added back to the pool of available resources after the timeout specified by
+ <code>spark.blacklist.timeout</code>. Note that with dynamic allocation, though, the executors
+ may get marked as idle and be reclaimed by the cluster manager.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.blacklist.application.maxFailedExecutorsPerNode</code></td>
+ <td>2</td>
+ <td>
+ (Experimental) How many different executors must be blacklisted for the entire application,
+ before the node is blacklisted for the entire application. Blacklisted nodes will
+ be automatically added back to the pool of available resources after the timeout specified by
+ <code>spark.blacklist.timeout</code>. Note that with dynamic allocation, though, the executors
+ on the node may get marked as idle and be reclaimed by the cluster manager.
+ </td>
+</tr>
+<tr>
<td><code>spark.speculation</code></td>
<td>false</td>
<td>