diff options
Diffstat (limited to 'docs/running-on-yarn.md')
-rw-r--r-- | docs/running-on-yarn.md | 60 |
1 files changed, 48 insertions, 12 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index c2957e6cb4..66fb8d73e8 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -5,24 +5,54 @@ title: Launching Spark on YARN Experimental support for running over a [YARN (Hadoop NextGen)](http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html) -cluster was added to Spark in version 0.6.0. Because YARN depends on version -2.0 of the Hadoop libraries, this currently requires checking out a separate -branch of Spark, called `yarn`, which you can do as follows: +cluster was added to Spark in version 0.6.0. This was merged into master as part of 0.7 effort. +To build spark core with YARN support, please use the hadoop2-yarn profile. +Ex: mvn -Phadoop2-yarn clean install - git clone git://github.com/mesos/spark - cd spark - git checkout -b yarn --track origin/yarn +# Building spark core consolidated jar. + +We need a consolidated spark core jar (which bundles all the required dependencies) to run Spark jobs on a yarn cluster. +This can be built either through sbt or via maven. + +- Building spark assembled jar via sbt. + It is a manual process of enabling it in project/SparkBuild.scala. +Please comment out the + HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN +variables before the line 'For Hadoop 2 YARN support' +Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support. + +Assembly of the jar Ex: + + ./sbt/sbt clean assembly + +The assembled jar would typically be something like : +`./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar` + + +- Building spark assembled jar via Maven. + Use the hadoop2-yarn profile and execute the package target. + +Something like this. Ex: + + mvn -Phadoop2-yarn clean package -DskipTests=true + + +This will build the shaded (consolidated) jar. Typically something like : +`./repl-bin/target/spark-repl-bin-<VERSION>-shaded-hadoop2-yarn.jar` # Preparations -- In order to distribute Spark within the cluster, it must be packaged into a single JAR file. This can be done by running `sbt/sbt assembly` +- Building spark core assembled jar (see above). - Your application code must be packaged into a separate JAR file. If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt package`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different. # Launching Spark on YARN +Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster. +This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager. + The command to launch the YARN Client is as follows: SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \ @@ -30,22 +60,28 @@ The command to launch the YARN Client is as follows: --class <APP_MAIN_CLASS> \ --args <APP_MAIN_ARGUMENTS> \ --num-workers <NUMBER_OF_WORKER_MACHINES> \ + --master-memory <MEMORY_FOR_MASTER> \ --worker-memory <MEMORY_PER_WORKER> \ - --worker-cores <CORES_PER_WORKER> + --worker-cores <CORES_PER_WORKER> \ + --user <hadoop_user> \ + --queue <queue_name> For example: SPARK_JAR=./core/target/spark-core-assembly-{{site.SPARK_VERSION}}.jar ./run spark.deploy.yarn.Client \ --jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}.jar \ --class spark.examples.SparkPi \ - --args standalone \ + --args yarn-standalone \ --num-workers 3 \ + --master-memory 4g \ --worker-memory 2g \ - --worker-cores 2 + --worker-cores 1 The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running. # Important Notes -- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above. -- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed. +- When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above. +- We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed. +- Currently, we have not yet integrated with hadoop security. If --user is present, the hadoop_user specified will be used to run the tasks on the cluster. If unspecified, current user will be used (which should be valid in cluster). + Once hadoop security support is added, and if hadoop cluster is enabled with security, additional restrictions would apply via delegation tokens passed. |