diff options
author | Imran Rashid <irashid@cloudera.com> | 2016-12-15 08:29:56 -0600 |
---|---|---|
committer | Imran Rashid <irashid@cloudera.com> | 2016-12-15 08:29:56 -0600 |
commit | 93cdb8a7d0f124b4db069fd8242207c82e263c52 (patch) | |
tree | c0f626664bfa6bad965b85a3cc54438bf15b4332 /docs | |
parent | 7d858bc5ce870a28a559f4e81dcfc54cbd128cb7 (diff) | |
download | spark-93cdb8a7d0f124b4db069fd8242207c82e263c52.tar.gz spark-93cdb8a7d0f124b4db069fd8242207c82e263c52.tar.bz2 spark-93cdb8a7d0f124b4db069fd8242207c82e263c52.zip |
[SPARK-8425][CORE] Application Level Blacklisting
## What changes were proposed in this pull request?
This builds upon the blacklisting introduced in SPARK-17675 to add blacklisting of executors and nodes for an entire Spark application. Resources are blacklisted based on tasks that fail, in tasksets that eventually complete successfully; they are automatically returned to the pool of active resources based on a timeout. Full details are available in a design doc attached to the jira.
## How was this patch tested?
Added unit tests, ran them via Jenkins, also ran a handful of them in a loop to check for flakiness.
The added tests include:
- verifying BlacklistTracker works correctly
- verifying TaskSchedulerImpl interacts with BlacklistTracker correctly (via a mock BlacklistTracker)
- an integration test for the entire scheduler with blacklisting in a few different scenarios
Author: Imran Rashid <irashid@cloudera.com>
Author: mwws <wei.mao@intel.com>
Closes #14079 from squito/blacklist-SPARK-8425.
Diffstat (limited to 'docs')
-rw-r--r-- | docs/configuration.md | 30 |
1 files changed, 30 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md index 7e466d7dc1..07bcd4aa7f 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1316,6 +1316,14 @@ Apart from these, the following properties are also available, and may be useful </td> </tr> <tr> + <td><code>spark.blacklist.timeout</code></td> + <td>1h</td> + <td> + (Experimental) How long a node or executor is blacklisted for the entire application, before it + is unconditionally removed from the blacklist to attempt running new tasks. + </td> +</tr> +<tr> <td><code>spark.blacklist.task.maxTaskAttemptsPerExecutor</code></td> <td>1</td> <td> @@ -1348,6 +1356,28 @@ Apart from these, the following properties are also available, and may be useful </td> </tr> <tr> + <td><code>spark.blacklist.application.maxFailedTasksPerExecutor</code></td> + <td>2</td> + <td> + (Experimental) How many different tasks must fail on one executor, in successful task sets, + before the executor is blacklisted for the entire application. Blacklisted executors will + be automatically added back to the pool of available resources after the timeout specified by + <code>spark.blacklist.timeout</code>. Note that with dynamic allocation, though, the executors + may get marked as idle and be reclaimed by the cluster manager. + </td> +</tr> +<tr> + <td><code>spark.blacklist.application.maxFailedExecutorsPerNode</code></td> + <td>2</td> + <td> + (Experimental) How many different executors must be blacklisted for the entire application, + before the node is blacklisted for the entire application. Blacklisted nodes will + be automatically added back to the pool of available resources after the timeout specified by + <code>spark.blacklist.timeout</code>. Note that with dynamic allocation, though, the executors + on the node may get marked as idle and be reclaimed by the cluster manager. + </td> +</tr> +<tr> <td><code>spark.speculation</code></td> <td>false</td> <td> |