Merge remote-tracking branch 'tgravescs/sparkYarnDistCache'

Closes #11 Conflicts: docs/running-on-yarn.md yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
author: Matei Zaharia <matei@eecs.berkeley.edu> 2013-10-10 19:34:33 -0700
committer: Matei Zaharia <matei@eecs.berkeley.edu> 2013-10-10 19:34:33 -0700
commit: 8f11c36fe17c2718c895b771b59a9138521e0079 (patch)
tree: ae2bdd4eec278538fd5c6e971b3a1e42cfa62b60 /docs/running-on-yarn.md
parent: c71499b7795564e1d16495c59273ecc027070fc5 (diff)
parent: 0fff4ee8523ff4137eedfc314b51135427137c63 (diff)
download: spark-8f11c36fe17c2718c895b771b59a9138521e0079.tar.gz
spark-8f11c36fe17c2718c895b771b59a9138521e0079.tar.bz2
spark-8f11c36fe17c2718c895b771b59a9138521e0079.zip
1 files changed, 8 insertions, 1 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 30128ec45d..2898af0bed 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -34,6 +34,8 @@ Environment variables:
 
 System Properties:
 * 'spark.yarn.applicationMaster.waitTries', property to set the number of times the ApplicationMaster waits for the the spark master and then also the number of tries it waits for the Spark Context to be intialized. Default is 10.
+* 'spark.yarn.submit.file.replication', the HDFS replication level for the files uploaded into HDFS for the application. These include things like the spark jar, the app jar, and any distributed cache files/archives.
+* 'spark.yarn.preserve.staging.files', set to true to preserve the staged files(spark jar, app jar, distributed cache files) at the end of the job rather then delete them.
 
 # Launching Spark on YARN
 
@@ -51,7 +53,10 @@ The command to launch the YARN Client is as follows:
       --worker-memory <MEMORY_PER_WORKER> \
       --worker-cores <CORES_PER_WORKER> \
       --name <application_name> \
-      --queue <queue_name>
+      --queue <queue_name> \
+      --addJars <any_local_files_used_in_SparkContext.addJar> \
+      --files <files_for_distributed_cache> \
+      --archives <archives_for_distributed_cache>
 
 For example:
 
@@ -84,3 +89,5 @@ The above starts a YARN Client programs which periodically polls the Application
 - When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above.
 - We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
 - The local directories used for spark will be the local directories configured for YARN (Hadoop Yarn config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored.
+- The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt and your application should use the name as appSees.txt to reference it when running on YARN.
+- The --addJars option allows the SparkContext.addJar function to work if you are using it with local files. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
author	Matei Zaharia <matei@eecs.berkeley.edu>	2013-10-10 19:34:33 -0700
committer	Matei Zaharia <matei@eecs.berkeley.edu>	2013-10-10 19:34:33 -0700
commit	8f11c36fe17c2718c895b771b59a9138521e0079 (patch)
tree	ae2bdd4eec278538fd5c6e971b3a1e42cfa62b60 /docs/running-on-yarn.md
parent	c71499b7795564e1d16495c59273ecc027070fc5 (diff)
parent	0fff4ee8523ff4137eedfc314b51135427137c63 (diff)
download	spark-8f11c36fe17c2718c895b771b59a9138521e0079.tar.gz spark-8f11c36fe17c2718c895b771b59a9138521e0079.tar.bz2 spark-8f11c36fe17c2718c895b771b59a9138521e0079.zip