[SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive.

In preparation for the demise of assemblies, this change allows the YARN backend to use multiple jars and globs as the "Spark jar". The config option has been renamed to "spark.yarn.jars" to reflect that. A second option "spark.yarn.archive" was also added; if set, this takes precedence and uploads an archive expected to contain the jar files with the Spark code and its dependencies. Existing deployments should keep working, mostly. This change drops support for the "SPARK_JAR" environment variable, and also does not fall back to using "jarOfClass" if no configuration is set, falling back to finding files under SPARK_HOME instead. This should be fine since "jarOfClass" probably wouldn't work unless you were using spark-submit anyway. Tested with the unit tests, and trying the different config options on a YARN cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11500 from vanzin/SPARK-13577.
author: Marcelo Vanzin <vanzin@cloudera.com> 2016-03-11 07:54:57 -0600
committer: Tom Graves <tgraves@yahoo-inc.com> 2016-03-11 07:54:57 -0600
commit: 07f1c5447753a3d593cd6ececfcb03c11b1cf8ff (patch)
tree: 74c4c9f81e64cc1ddde0b1c5e554a836808609e1 /docs/running-on-yarn.md
parent: 8fff0f92a4aca90b62c6e272eabcbb0257ba38d5 (diff)
download: spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.tar.gz
spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.tar.bz2
spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.zip
1 files changed, 18 insertions, 7 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index ad66b9f64a..8045f8c5b8 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -272,14 +272,25 @@ If you need a reference to the proper location to put log files in the YARN so t
   </td>
 </tr>
 <tr>
-  <td><code>spark.yarn.jar</code></td>
+  <td><code>spark.yarn.jars</code></td>
   <td>(none)</td>
   <td>
-    The location of the Spark jar file, in case overriding the default location is desired.
-    By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be
+    List of libraries containing Spark code to distribute to YARN containers.
+    By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be
     in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't
-    need to be distributed each time an application runs. To point to a jar on HDFS, for example,
-    set this configuration to <code>hdfs:///some/path</code>.
+    need to be distributed each time an application runs. To point to jars on HDFS, for example,
+    set this configuration to <code>hdfs:///some/path</code>. Globs are allowed.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.yarn.archive</code></td>
+  <td>(none)</td>
+  <td>
+    An archive containing needed Spark jars for distribution to the YARN cache. If set, this
+    configuration replaces <code>spark.yarn.jars</code> and the archive is used in all the
+    application's containers. The archive should contain jar files in its root directory.
+    Like with the previous option, the archive can also be hosted on HDFS to speed up file
+    distribution.
   </td>
 </tr>
 <tr>
@@ -288,8 +299,8 @@ If you need a reference to the proper location to put log files in the YARN so t
   <td>
     A comma-separated list of secure HDFS namenodes your Spark application is going to access. For
     example, <code>spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032,
-    webhdfs://nn3.com:50070</code>. The Spark application must have access to the namenodes listed 
-    and Kerberos must be properly configured to be able to access them (either in the same realm 
+    webhdfs://nn3.com:50070</code>. The Spark application must have access to the namenodes listed
+    and Kerberos must be properly configured to be able to access them (either in the same realm
     or in a trusted realm). Spark acquires security tokens for each of the namenodes so that
     the Spark application can access those remote HDFS clusters.
   </td>
author	Marcelo Vanzin <vanzin@cloudera.com>	2016-03-11 07:54:57 -0600
committer	Tom Graves <tgraves@yahoo-inc.com>	2016-03-11 07:54:57 -0600
commit	07f1c5447753a3d593cd6ececfcb03c11b1cf8ff (patch)
tree	74c4c9f81e64cc1ddde0b1c5e554a836808609e1 /docs/running-on-yarn.md
parent	8fff0f92a4aca90b62c6e272eabcbb0257ba38d5 (diff)
download	spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.tar.gz spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.tar.bz2 spark-07f1c5447753a3d593cd6ececfcb03c11b1cf8ff.zip