aboutsummaryrefslogtreecommitdiff
path: root/docs/running-on-yarn.md
diff options
context:
space:
mode:
authorMarcelo Vanzin <vanzin@cloudera.com>2016-03-11 07:54:57 -0600
committerTom Graves <tgraves@yahoo-inc.com>2016-03-11 07:54:57 -0600
commit07f1c5447753a3d593cd6ececfcb03c11b1cf8ff (patch)
tree74c4c9f81e64cc1ddde0b1c5e554a836808609e1 /docs/running-on-yarn.md
parent8fff0f92a4aca90b62c6e272eabcbb0257ba38d5 (diff)
downloadspark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.tar.gz
spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.tar.bz2
spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.zip
[SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive.
In preparation for the demise of assemblies, this change allows the YARN backend to use multiple jars and globs as the "Spark jar". The config option has been renamed to "spark.yarn.jars" to reflect that. A second option "spark.yarn.archive" was also added; if set, this takes precedence and uploads an archive expected to contain the jar files with the Spark code and its dependencies. Existing deployments should keep working, mostly. This change drops support for the "SPARK_JAR" environment variable, and also does not fall back to using "jarOfClass" if no configuration is set, falling back to finding files under SPARK_HOME instead. This should be fine since "jarOfClass" probably wouldn't work unless you were using spark-submit anyway. Tested with the unit tests, and trying the different config options on a YARN cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11500 from vanzin/SPARK-13577.
Diffstat (limited to 'docs/running-on-yarn.md')
-rw-r--r--docs/running-on-yarn.md25
1 files changed, 18 insertions, 7 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index ad66b9f64a..8045f8c5b8 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -272,14 +272,25 @@ If you need a reference to the proper location to put log files in the YARN so t
</td>
</tr>
<tr>
- <td><code>spark.yarn.jar</code></td>
+ <td><code>spark.yarn.jars</code></td>
<td>(none)</td>
<td>
- The location of the Spark jar file, in case overriding the default location is desired.
- By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be
+ List of libraries containing Spark code to distribute to YARN containers.
+ By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be
in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't
- need to be distributed each time an application runs. To point to a jar on HDFS, for example,
- set this configuration to <code>hdfs:///some/path</code>.
+ need to be distributed each time an application runs. To point to jars on HDFS, for example,
+ set this configuration to <code>hdfs:///some/path</code>. Globs are allowed.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.yarn.archive</code></td>
+ <td>(none)</td>
+ <td>
+ An archive containing needed Spark jars for distribution to the YARN cache. If set, this
+ configuration replaces <code>spark.yarn.jars</code> and the archive is used in all the
+ application's containers. The archive should contain jar files in its root directory.
+ Like with the previous option, the archive can also be hosted on HDFS to speed up file
+ distribution.
</td>
</tr>
<tr>
@@ -288,8 +299,8 @@ If you need a reference to the proper location to put log files in the YARN so t
<td>
A comma-separated list of secure HDFS namenodes your Spark application is going to access. For
example, <code>spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032,
- webhdfs://nn3.com:50070</code>. The Spark application must have access to the namenodes listed
- and Kerberos must be properly configured to be able to access them (either in the same realm
+ webhdfs://nn3.com:50070</code>. The Spark application must have access to the namenodes listed
+ and Kerberos must be properly configured to be able to access them (either in the same realm
or in a trusted realm). Spark acquires security tokens for each of the namenodes so that
the Spark application can access those remote HDFS clusters.
</td>