Support distributed cache files and archives on spark on yarn and attempt to cleanup the staging directory on exit

author: Y.CORP.YAHOO.COM\tgraves <tgraves@thatenemy-lm.champ.corp.yahoo.com> 2013-09-23 09:09:59 -0500
committer: Y.CORP.YAHOO.COM\tgraves <tgraves@thatenemy-lm.champ.corp.yahoo.com> 2013-09-23 09:09:59 -0500
commit: 9d4246863a25f7c91f324e004fe000b9848f6057 (patch)
tree: bc5c669de3cb38bcbc4b323c0bd12253848c0d6e /docs/running-on-yarn.md
parent: 119de80294bd0cb82855bd1982c5371b661b6fd5 (diff)
download: spark-9d4246863a25f7c91f324e004fe000b9848f6057.tar.gz
spark-9d4246863a25f7c91f324e004fe000b9848f6057.tar.bz2
spark-9d4246863a25f7c91f324e004fe000b9848f6057.zip
1 files changed, 6 insertions, 1 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index c611db0af4..beaae69aa2 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -34,6 +34,8 @@ Environment variables:
 
 System Properties:
 * 'spark.yarn.applicationMaster.waitTries', property to set the number of times the ApplicationMaster waits for the the spark master and then also the number of tries it waits for the Spark Context to be intialized. Default is 10.
+* 'spark.yarn.submit.file.replication', the HDFS replication level for the files uploaded into HDFS for the application. These include things like the spark jar, the app jar, and any distributed cache files/archives.
+* 'spark.yarn.preserve.staging.files', set to true to preserve the staged files(spark jar, app jar, distributed cache files) at the end of the job rather then delete them.
 
 # Launching Spark on YARN
 
@@ -50,7 +52,9 @@ The command to launch the YARN Client is as follows:
       --master-memory <MEMORY_FOR_MASTER> \
       --worker-memory <MEMORY_PER_WORKER> \
       --worker-cores <CORES_PER_WORKER> \
-      --queue <queue_name>
+      --queue <queue_name> \
+      --files <files_for_distributed_cache> \
+      --archives <archives_for_distributed_cache>
 
 For example:
 
@@ -83,3 +87,4 @@ The above starts a YARN Client programs which periodically polls the Application
 - When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above.
 - We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
 - The local directories used for spark will be the local directories configured for YARN (Hadoop Yarn config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored.
+- The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt and your application should use the name as appSees.txt to reference it when running on YARN.
author	Y.CORP.YAHOO.COM\tgraves <tgraves@thatenemy-lm.champ.corp.yahoo.com>	2013-09-23 09:09:59 -0500
committer	Y.CORP.YAHOO.COM\tgraves <tgraves@thatenemy-lm.champ.corp.yahoo.com>	2013-09-23 09:09:59 -0500
commit	9d4246863a25f7c91f324e004fe000b9848f6057 (patch)
tree	bc5c669de3cb38bcbc4b323c0bd12253848c0d6e /docs/running-on-yarn.md
parent	119de80294bd0cb82855bd1982c5371b661b6fd5 (diff)
download	spark-9d4246863a25f7c91f324e004fe000b9848f6057.tar.gz spark-9d4246863a25f7c91f324e004fe000b9848f6057.tar.bz2 spark-9d4246863a25f7c91f324e004fe000b9848f6057.zip