aboutsummaryrefslogblamecommitdiff
path: root/docs/running-on-yarn.md
blob: 19e7aede271c54335f1448df9f7747edb2ae9658 (plain) (tree)
1
2
3
4
5
6
7
8




                              
                                                                                                                                                  
 
              





                                                                                                                                                                                         
                         


                                                    






                                                               


            






                                                                                                   


                                                                                                                                                                                                             
                 


                                                                                                                                                                                                                                                                                          
---
layout: global
title: Launching Spark on YARN
---

Spark allows you to launch jobs on an existing [YARN](http://hadoop.apache.org/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html) cluster. 

# Preparations

- In order to distribute Spark within the cluster it must be packaged into a single JAR file. This can be done by running `sbt/sbt assembly`
- Your application code must be packaged into a separate jar file.

If you want to test out the YARN deployment mode, you can use the current spark examples. A `spark-examples_2.9.1-0.6.0-SNAPSHOT.jar` file can be generated by running `sbt/sbt package`.

# Launching Spark on YARN

The command to launch the YARN Client is as follows:

    SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \
      --jar <YOUR_APP_JAR_FILE> \
      --class <APP_MAIN_CLASS> \
      --args <APP_MAIN_ARGUMENTS> \
      --num-workers <NUMBER_OF_WORKER_MACHINES> \
      --worker-memory <MEMORY_PER_WORKER> \
      --worker-cores <CORES_PER_WORKER>

For example:

    SPARK_JAR=./core/target/spark-core-assembly-0.6.0-SNAPSHOT.jar ./run spark.deploy.yarn.Client \
      --jar examples/target/scala-2.9.1/spark-examples_2.9.1-0.6.0-SNAPSHOT.jar \
      --class spark.examples.SparkPi \
      --args standalone \
      --num-workers 3 \
      --worker-memory 2g \
      --worker-cores 2

The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.

# Important Notes

- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.