aboutsummaryrefslogtreecommitdiff
path: root/docs/running-on-yarn.md
diff options
context:
space:
mode:
authorPatrick Wendell <pwendell@gmail.com>2013-07-31 21:35:12 -0700
committerPatrick Wendell <pwendell@gmail.com>2013-07-31 21:35:12 -0700
commit5cc725a0e3ef523affae8ff54dd74707e49d64e3 (patch)
treeebd1698333d2df4194f17a9ea93a2f2eac2c7acd /docs/running-on-yarn.md
parentb7b627d5bb1a1331ea580950834533f84735df4c (diff)
parentf3cf09491a2b63e19a15e98cf815da503e4fb69b (diff)
downloadspark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.gz
spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.bz2
spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.zip
Merge branch 'master' into ec2-updates
Conflicts: ec2/deploy.generic/root/mesos-ec2/ec2-variables.sh
Diffstat (limited to 'docs/running-on-yarn.md')
-rw-r--r--docs/running-on-yarn.md35
1 files changed, 29 insertions, 6 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 26424bbe52..66fb8d73e8 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -11,14 +11,34 @@ Ex: mvn -Phadoop2-yarn clean install
# Building spark core consolidated jar.
-Currently, only sbt can buid a consolidated jar which contains the entire spark code - which is required for launching jars on yarn.
-To do this via sbt - though (right now) is a manual process of enabling it in project/SparkBuild.scala.
+We need a consolidated spark core jar (which bundles all the required dependencies) to run Spark jobs on a yarn cluster.
+This can be built either through sbt or via maven.
+
+- Building spark assembled jar via sbt.
+ It is a manual process of enabling it in project/SparkBuild.scala.
Please comment out the
HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
variables before the line 'For Hadoop 2 YARN support'
Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
-Currnetly, it is a TODO to add support for maven assembly.
+Assembly of the jar Ex:
+
+ ./sbt/sbt clean assembly
+
+The assembled jar would typically be something like :
+`./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar`
+
+
+- Building spark assembled jar via Maven.
+ Use the hadoop2-yarn profile and execute the package target.
+
+Something like this. Ex:
+
+ mvn -Phadoop2-yarn clean package -DskipTests=true
+
+
+This will build the shaded (consolidated) jar. Typically something like :
+`./repl-bin/target/spark-repl-bin-<VERSION>-shaded-hadoop2-yarn.jar`
# Preparations
@@ -30,6 +50,9 @@ If you want to test out the YARN deployment mode, you can use the current Spark
# Launching Spark on YARN
+Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster.
+This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager.
+
The command to launch the YARN Client is as follows:
SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \
@@ -48,7 +71,7 @@ For example:
SPARK_JAR=./core/target/spark-core-assembly-{{site.SPARK_VERSION}}.jar ./run spark.deploy.yarn.Client \
--jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}.jar \
--class spark.examples.SparkPi \
- --args standalone \
+ --args yarn-standalone \
--num-workers 3 \
--master-memory 4g \
--worker-memory 2g \
@@ -58,7 +81,7 @@ The above starts a YARN Client programs which periodically polls the Application
# Important Notes
-- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
-- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
+- When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above.
+- We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
- Currently, we have not yet integrated with hadoop security. If --user is present, the hadoop_user specified will be used to run the tasks on the cluster. If unspecified, current user will be used (which should be valid in cluster).
Once hadoop security support is added, and if hadoop cluster is enabled with security, additional restrictions would apply via delegation tokens passed.