Merge branch 'master' into ec2-updates

Conflicts: ec2/deploy.generic/root/mesos-ec2/ec2-variables.sh
author: Patrick Wendell <pwendell@gmail.com> 2013-07-31 21:35:12 -0700
committer: Patrick Wendell <pwendell@gmail.com> 2013-07-31 21:35:12 -0700
commit: 5cc725a0e3ef523affae8ff54dd74707e49d64e3 (patch)
tree: ebd1698333d2df4194f17a9ea93a2f2eac2c7acd /docs/running-on-yarn.md
parent: b7b627d5bb1a1331ea580950834533f84735df4c (diff)
parent: f3cf09491a2b63e19a15e98cf815da503e4fb69b (diff)
download: spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.gz
spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.bz2
spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.zip
1 files changed, 29 insertions, 6 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 26424bbe52..66fb8d73e8 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -11,14 +11,34 @@ Ex:  mvn -Phadoop2-yarn clean install
 
 # Building spark core consolidated jar.
 
-Currently, only sbt can buid a consolidated jar which contains the entire spark code - which is required for launching jars on yarn.
-To do this via sbt - though (right now) is a manual process of enabling it in project/SparkBuild.scala.
+We need a consolidated spark core jar (which bundles all the required dependencies) to run Spark jobs on a yarn cluster.
+This can be built either through sbt or via maven.
+
+-   Building spark assembled jar via sbt.
+    It is a manual process of enabling it in project/SparkBuild.scala.
 Please comment out the
   HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
 variables before the line 'For Hadoop 2 YARN support'
 Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
 
-Currnetly, it is a TODO to add support for maven assembly.
+Assembly of the jar Ex:
+
+    ./sbt/sbt clean assembly
+
+The assembled jar would typically be something like :
+`./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar`
+
+
+-   Building spark assembled jar via Maven.
+    Use the hadoop2-yarn profile and execute the package target.
+
+Something like this. Ex:
+
+    mvn -Phadoop2-yarn clean package -DskipTests=true
+
+
+This will build the shaded (consolidated) jar. Typically something like :
+`./repl-bin/target/spark-repl-bin-<VERSION>-shaded-hadoop2-yarn.jar`
 
 
 # Preparations
@@ -30,6 +50,9 @@ If you want to test out the YARN deployment mode, you can use the current Spark
 
 # Launching Spark on YARN
 
+Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster.
+This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager.
+
 The command to launch the YARN Client is as follows:
 
     SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \
@@ -48,7 +71,7 @@ For example:
     SPARK_JAR=./core/target/spark-core-assembly-{{site.SPARK_VERSION}}.jar ./run spark.deploy.yarn.Client \
       --jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}.jar \
       --class spark.examples.SparkPi \
-      --args standalone \
+      --args yarn-standalone \
       --num-workers 3 \
       --master-memory 4g \
       --worker-memory 2g \
@@ -58,7 +81,7 @@ The above starts a YARN Client programs which periodically polls the Application
 
 # Important Notes
 
-- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
-- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
+- When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above.
+- We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
 - Currently, we have not yet integrated with hadoop security. If --user is present, the hadoop_user specified will be used to run the tasks on the cluster. If unspecified, current user will be used (which should be valid in cluster).
   Once hadoop security support is added, and if hadoop cluster is enabled with security, additional restrictions would apply via delegation tokens passed.
author	Patrick Wendell <pwendell@gmail.com>	2013-07-31 21:35:12 -0700
committer	Patrick Wendell <pwendell@gmail.com>	2013-07-31 21:35:12 -0700
commit	5cc725a0e3ef523affae8ff54dd74707e49d64e3 (patch)
tree	ebd1698333d2df4194f17a9ea93a2f2eac2c7acd /docs/running-on-yarn.md
parent	b7b627d5bb1a1331ea580950834533f84735df4c (diff)
parent	f3cf09491a2b63e19a15e98cf815da503e4fb69b (diff)
download	spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.gz spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.bz2 spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.zip