aboutsummaryrefslogtreecommitdiff
path: root/docs/running-on-yarn.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/running-on-yarn.md')
-rw-r--r--docs/running-on-yarn.md51
1 files changed, 51 insertions, 0 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
new file mode 100644
index 0000000000..6fb81b6004
--- /dev/null
+++ b/docs/running-on-yarn.md
@@ -0,0 +1,51 @@
+---
+layout: global
+title: Launching Spark on YARN
+---
+
+Experimental support for running over a [YARN (Hadoop
+NextGen)](http://hadoop.apache.org/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html)
+cluster was added to Spark in version 0.6.0. Because YARN depends on version
+2.0 of the Hadoop libraries, this currently requires checking out a separate
+branch of Spark, called `yarn`, which you can do as follows:
+
+ git clone git://github.com/mesos/spark
+ cd spark
+ git checkout -b yarn --track origin/yarn
+
+
+# Preparations
+
+- In order to distribute Spark within the cluster, it must be packaged into a single JAR file. This can be done by running `sbt/sbt assembly`
+- Your application code must be packaged into a separate JAR file.
+
+If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt package`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.
+
+# Launching Spark on YARN
+
+The command to launch the YARN Client is as follows:
+
+ SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \
+ --jar <YOUR_APP_JAR_FILE> \
+ --class <APP_MAIN_CLASS> \
+ --args <APP_MAIN_ARGUMENTS> \
+ --num-workers <NUMBER_OF_WORKER_MACHINES> \
+ --worker-memory <MEMORY_PER_WORKER> \
+ --worker-cores <CORES_PER_WORKER>
+
+For example:
+
+ SPARK_JAR=./core/target/spark-core-assembly-{{site.SPARK_VERSION}}.jar ./run spark.deploy.yarn.Client \
+ --jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}.jar \
+ --class spark.examples.SparkPi \
+ --args standalone \
+ --num-workers 3 \
+ --worker-memory 2g \
+ --worker-cores 2
+
+The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.
+
+# Important Notes
+
+- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
+- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.