From 6d53b971b9ce593898fda7705a105400f5ab6a46 Mon Sep 17 00:00:00 2001 From: Denny Date: Thu, 13 Sep 2012 09:47:54 -0700 Subject: Added standalone and YARN docs. Merged standalone cluster into standalone doc --- docs/running-on-yarn.md | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 docs/running-on-yarn.md (limited to 'docs/running-on-yarn.md') diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md new file mode 100644 index 0000000000..3c0e54671b --- /dev/null +++ b/docs/running-on-yarn.md @@ -0,0 +1,42 @@ +--- +layout: global +title: Launching Spark on YARN +--- + +Spark allows you to launch jobs on an existing [YARN](http://hadoop.apache.org/common/docs/r0.23.1/hadoop-yarn/hadoop-yarn-site/YARN.html) cluster. + +## Preparations + +- In order to distribute Spark within the cluster it must be packaged into a single JAR file. This can be done by running `sbt/sbt assembly` +- Your application code must be packaged into a separate jar file. + +If you want to test out the YARN deployment mode, you can use the current spark examples. A `spark-examples_2.9.1-0.6.0-SNAPSHOT.jar` file can be generated by running `sbt/sbt package`. + +## Launching Spark on YARN + +The command to launch the YARN Client is as follows: + + SPARK_JAR= ./run spark.deploy.yarn.Client + --jar + --class + --args + --num-workers + --worker-memory + --worker-cores + +For example: + + SPARK_JAR=./core/target/spark-core-assembly-0.6.0-SNAPSHOT.jar ./run spark.deploy.yarn.Client + --jar examples/target/scala-2.9.1/spark-examples_2.9.1-0.6.0-SNAPSHOT.jar + --class spark.examples.SparkPi + --args standalone + --num-workers 3 + --worker-memory 2g + --worker-cores 2 + +The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running. + +## Important Notes + +- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above. +- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed. -- cgit v1.2.3