aboutsummaryrefslogtreecommitdiff
path: root/docs/running-on-yarn.md
diff options
context:
space:
mode:
authorMatei Zaharia <matei@eecs.berkeley.edu>2013-10-10 19:34:33 -0700
committerMatei Zaharia <matei@eecs.berkeley.edu>2013-10-10 19:34:33 -0700
commit8f11c36fe17c2718c895b771b59a9138521e0079 (patch)
treeae2bdd4eec278538fd5c6e971b3a1e42cfa62b60 /docs/running-on-yarn.md
parentc71499b7795564e1d16495c59273ecc027070fc5 (diff)
parent0fff4ee8523ff4137eedfc314b51135427137c63 (diff)
downloadspark-8f11c36fe17c2718c895b771b59a9138521e0079.tar.gz
spark-8f11c36fe17c2718c895b771b59a9138521e0079.tar.bz2
spark-8f11c36fe17c2718c895b771b59a9138521e0079.zip
Merge remote-tracking branch 'tgravescs/sparkYarnDistCache'
Closes #11 Conflicts: docs/running-on-yarn.md yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
Diffstat (limited to 'docs/running-on-yarn.md')
-rw-r--r--docs/running-on-yarn.md9
1 files changed, 8 insertions, 1 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 30128ec45d..2898af0bed 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -34,6 +34,8 @@ Environment variables:
System Properties:
* 'spark.yarn.applicationMaster.waitTries', property to set the number of times the ApplicationMaster waits for the the spark master and then also the number of tries it waits for the Spark Context to be intialized. Default is 10.
+* 'spark.yarn.submit.file.replication', the HDFS replication level for the files uploaded into HDFS for the application. These include things like the spark jar, the app jar, and any distributed cache files/archives.
+* 'spark.yarn.preserve.staging.files', set to true to preserve the staged files(spark jar, app jar, distributed cache files) at the end of the job rather then delete them.
# Launching Spark on YARN
@@ -51,7 +53,10 @@ The command to launch the YARN Client is as follows:
--worker-memory <MEMORY_PER_WORKER> \
--worker-cores <CORES_PER_WORKER> \
--name <application_name> \
- --queue <queue_name>
+ --queue <queue_name> \
+ --addJars <any_local_files_used_in_SparkContext.addJar> \
+ --files <files_for_distributed_cache> \
+ --archives <archives_for_distributed_cache>
For example:
@@ -84,3 +89,5 @@ The above starts a YARN Client programs which periodically polls the Application
- When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above.
- We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
- The local directories used for spark will be the local directories configured for YARN (Hadoop Yarn config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored.
+- The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt and your application should use the name as appSees.txt to reference it when running on YARN.
+- The --addJars option allows the SparkContext.addJar function to work if you are using it with local files. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.