[SPARK-16505][YARN] Optionally propagate error during shuffle service startup.

This prevents the NM from starting when something is wrong, which would lead to later errors which are confusing and harder to debug. Added a unit test to verify startup fails if something is wrong. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14162 from vanzin/SPARK-16505.
author: Marcelo Vanzin <vanzin@cloudera.com> 2016-07-14 09:42:32 -0500
committer: Tom Graves <tgraves@yahoo-inc.com> 2016-07-14 09:42:32 -0500
commit: b7b5e17876f65c6644505c356f1a0db24ce1d142 (patch)
tree: e4e6ae132a696260191ecb7e547457d6bbf5e4bf /docs/running-on-yarn.md
parent: c4bc2ed844ea045d2e8218154690b5b2b023f1e5 (diff)
download: spark-b7b5e17876f65c6644505c356f1a0db24ce1d142.tar.gz
spark-b7b5e17876f65c6644505c356f1a0db24ce1d142.tar.bz2
spark-b7b5e17876f65c6644505c356f1a0db24ce1d142.zip
1 files changed, 31 insertions, 0 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 4e92042da6..befd3eaee9 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -539,6 +539,37 @@ launch time. This is done by listing them in the `spark.yarn.access.namenodes` p
 spark.yarn.access.namenodes hdfs://ireland.example.org:8020/,hdfs://frankfurt.example.org:8020/
 ```
 
+## Configuring the External Shuffle Service
+
+To start the Spark Shuffle Service on each `NodeManager` in your YARN cluster, follow these
+instructions:
+
+1. Build Spark with the [YARN profile](building-spark.html). Skip this step if you are using a
+pre-packaged distribution.
+1. Locate the `spark-<version>-yarn-shuffle.jar`. This should be under
+`$SPARK_HOME/common/network-yarn/target/scala-<version>` if you are building Spark yourself, and under
+`lib` if you are using a distribution.
+1. Add this jar to the classpath of all `NodeManager`s in your cluster.
+1. In the `yarn-site.xml` on each node, add `spark_shuffle` to `yarn.nodemanager.aux-services`,
+then set `yarn.nodemanager.aux-services.spark_shuffle.class` to
+`org.apache.spark.network.yarn.YarnShuffleService`.
+1. Restart all `NodeManager`s in your cluster.
+
+The following extra configuration options are available when the shuffle service is running on YARN:
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td><code>spark.yarn.shuffle.stopOnFailure</code></td>
+  <td><code>false</code></td>
+  <td>
+    Whether to stop the NodeManager when there's a failure in the Spark Shuffle Service's
+    initialization. This prevents application failures caused by running containers on
+    NodeManagers where the Spark Shuffle Service is not running.
+  </td>
+</tr>
+</table>
+
 ## Launching your application with Apache Oozie
 
 Apache Oozie can launch Spark applications as part of a workflow.
author	Marcelo Vanzin <vanzin@cloudera.com>	2016-07-14 09:42:32 -0500
committer	Tom Graves <tgraves@yahoo-inc.com>	2016-07-14 09:42:32 -0500
commit	b7b5e17876f65c6644505c356f1a0db24ce1d142 (patch)
tree	e4e6ae132a696260191ecb7e547457d6bbf5e4bf /docs/running-on-yarn.md
parent	c4bc2ed844ea045d2e8218154690b5b2b023f1e5 (diff)
download	spark-b7b5e17876f65c6644505c356f1a0db24ce1d142.tar.gz spark-b7b5e17876f65c6644505c356f1a0db24ce1d142.tar.bz2 spark-b7b5e17876f65c6644505c356f1a0db24ce1d142.zip