Add some basic documentation

author: Mridul Muralidharan <mridul@gmail.com> 2013-04-19 00:13:19 +0530
committer: Mridul Muralidharan <mridul@gmail.com> 2013-04-19 00:13:19 +0530
commit: ac2e8e8720f10efd640a67ad85270719ab2d43e9 (patch)
tree: b682b8e0325449ec92b11f143cd16590e5765d3b /docs/running-on-yarn.md
parent: 5ee2f5c4837f0098282d93c85e606e1a3af40dd6 (diff)
download: spark-ac2e8e8720f10efd640a67ad85270719ab2d43e9.tar.gz
spark-ac2e8e8720f10efd640a67ad85270719ab2d43e9.tar.bz2
spark-ac2e8e8720f10efd640a67ad85270719ab2d43e9.zip
1 files changed, 22 insertions, 9 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index c2957e6cb4..26424bbe52 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -5,18 +5,25 @@ title: Launching Spark on YARN
 
 Experimental support for running over a [YARN (Hadoop
 NextGen)](http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html)
-cluster was added to Spark in version 0.6.0.  Because YARN depends on version
-2.0 of the Hadoop libraries, this currently requires checking out a separate
-branch of Spark, called `yarn`, which you can do as follows:
+cluster was added to Spark in version 0.6.0.  This was merged into master as part of 0.7 effort.
+To build spark core with YARN support, please use the hadoop2-yarn profile.
+Ex:  mvn -Phadoop2-yarn clean install
 
-    git clone git://github.com/mesos/spark
-    cd spark
-    git checkout -b yarn --track origin/yarn
+# Building spark core consolidated jar.
+
+Currently, only sbt can buid a consolidated jar which contains the entire spark code - which is required for launching jars on yarn.
+To do this via sbt - though (right now) is a manual process of enabling it in project/SparkBuild.scala.
+Please comment out the
+  HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
+variables before the line 'For Hadoop 2 YARN support'
+Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
+
+Currnetly, it is a TODO to add support for maven assembly.
 
 
 # Preparations
 
-- In order to distribute Spark within the cluster, it must be packaged into a single JAR file. This can be done by running `sbt/sbt assembly`
+- Building spark core assembled jar (see above).
 - Your application code must be packaged into a separate JAR file.
 
 If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt package`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.
@@ -30,8 +37,11 @@ The command to launch the YARN Client is as follows:
       --class <APP_MAIN_CLASS> \
       --args <APP_MAIN_ARGUMENTS> \
       --num-workers <NUMBER_OF_WORKER_MACHINES> \
+      --master-memory <MEMORY_FOR_MASTER> \
       --worker-memory <MEMORY_PER_WORKER> \
-      --worker-cores <CORES_PER_WORKER>
+      --worker-cores <CORES_PER_WORKER> \
+      --user <hadoop_user> \
+      --queue <queue_name>
 
 For example:
 
@@ -40,8 +50,9 @@ For example:
       --class spark.examples.SparkPi \
       --args standalone \
       --num-workers 3 \
+      --master-memory 4g \
       --worker-memory 2g \
-      --worker-cores 2
+      --worker-cores 1
 
 The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.
 
@@ -49,3 +60,5 @@ The above starts a YARN Client programs which periodically polls the Application
 
 - When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
 - YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
+- Currently, we have not yet integrated with hadoop security. If --user is present, the hadoop_user specified will be used to run the tasks on the cluster. If unspecified, current user will be used (which should be valid in cluster).
+  Once hadoop security support is added, and if hadoop cluster is enabled with security, additional restrictions would apply via delegation tokens passed.
author	Mridul Muralidharan <mridul@gmail.com>	2013-04-19 00:13:19 +0530
committer	Mridul Muralidharan <mridul@gmail.com>	2013-04-19 00:13:19 +0530
commit	ac2e8e8720f10efd640a67ad85270719ab2d43e9 (patch)
tree	b682b8e0325449ec92b11f143cd16590e5765d3b /docs/running-on-yarn.md
parent	5ee2f5c4837f0098282d93c85e606e1a3af40dd6 (diff)
download	spark-ac2e8e8720f10efd640a67ad85270719ab2d43e9.tar.gz spark-ac2e8e8720f10efd640a67ad85270719ab2d43e9.tar.bz2 spark-ac2e8e8720f10efd640a67ad85270719ab2d43e9.zip