From 6585f49841ada637b0811e0aadcf93132fff7001 Mon Sep 17 00:00:00 2001
From: Jey Kottalam <jey@cs.berkeley.edu>
Date: Wed, 21 Aug 2013 14:51:56 -0700
Subject: Update build docs

---
 README.md                   | 46 +++++++++++++++++++++++++++++++++++++++++----
 docs/building-with-maven.md | 35 +++++++++++++++++++++++-----------
 docs/running-on-yarn.md     | 20 +++++++-------------
 3 files changed, 73 insertions(+), 28 deletions(-)
diff --git a/README.md b/README.md
index 1dd96a0a4a..1e388ff380 100644
--- a/README.md
+++ b/README.md
@@ -16,7 +16,7 @@ Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The project is
 built using Simple Build Tool (SBT), which is packaged with it. To build
 Spark and its example programs, run:
 
-    sbt/sbt package
+    sbt/sbt package assembly
 
 Spark also supports building using Maven. If you would like to build using Maven,
 see the [instructions for building Spark with Maven](http://spark-project.org/docs/latest/building-with-maven.html)
@@ -43,10 +43,48 @@ locally with one thread, or "local[N]" to run locally with N threads.
 ## A Note About Hadoop Versions
 
 Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
-storage systems. Because the HDFS API has changed in different versions of
+storage systems. Because the protocols have changed in different versions of
 Hadoop, you must build Spark against the same version that your cluster runs.
-You can change the version by setting the `HADOOP_VERSION` variable at the top
-of `project/SparkBuild.scala`, then rebuilding Spark.
+You can change the version by setting the `SPARK_HADOOP_VERSION` environment
+when building Spark.
+
+For Apache Hadoop versions 1.x, 0.20.x, Cloudera CDH MRv1, and other Hadoop
+versions without YARN, use:
+
+    # Apache Hadoop 1.2.1
+    $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt package assembly
+
+    # Cloudera CDH 4.2.0 with MapReduce v1
+    $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt package assembly
+
+For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
+with YARN, also set `SPARK_WITH_YARN=true`:
+
+    # Apache Hadoop 2.0.5-alpha
+    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true sbt/sbt package assembly
+
+    # Cloudera CDH 4.2.0 with MapReduce v2
+    $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_WITH_YARN=true sbt/sbt package assembly
+
+For convenience, these variables may also be set through the `conf/spark-env.sh` file
+described below.
+
+When developing a Spark application, specify the Hadoop version by adding the
+"hadoop-client" artifact to your project's dependencies. For example, if you're
+using Hadoop 0.23.9 and build your application using SBT, add this to
+`libraryDependencies`:
+
+    // "force()" is required because "0.23.9" is less than Spark's default of "1.0.4"
+    "org.apache.hadoop" % "hadoop-client" % "0.23.9" force()
+
+If your project is built with Maven, add this to your POM file's `<dependencies>` section:
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-client</artifactId>
+      <!-- the brackets are needed to tell Maven that this is a hard dependency on version "0.23.9" exactly -->
+      <version>[0.23.9]</version>
+    </dependency>
 
 
 ## Configuration
diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
index 04cd79d039..d71d94fa63 100644
--- a/docs/building-with-maven.md
+++ b/docs/building-with-maven.md
@@ -8,22 +8,26 @@ title: Building Spark with Maven
 
 Building Spark using Maven Requires Maven 3 (the build process is tested with Maven 3.0.4) and Java 1.6 or newer.
 
-Building with Maven requires that a Hadoop profile be specified explicitly at the command line, there is no default. There are two profiles to choose from, one for building for Hadoop 1 or Hadoop 2.
+## Specifying the Hadoop version ##
 
-for Hadoop 1 (using 0.20.205.0) use:
+To enable support for HDFS and other Hadoop-supported storage systems, specify the exact Hadoop version by setting the "hadoop.version" property. If unset, Spark will build against Hadoop 1.0.4 by default.
 
-    $ mvn -Phadoop1 clean install
+For Apache Hadoop versions 1.x, 0.20.x, Cloudera CDH MRv1, and other Hadoop versions without YARN, use:
 
+    # Apache Hadoop 1.2.1
+    $ mvn -Dhadoop.version=1.2.1 clean install
 
-for Hadoop 2 (using 2.0.0-mr1-cdh4.1.1) use:
+    # Cloudera CDH 4.2.0 with MapReduce v1
+    $ mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 clean install
 
-    $ mvn -Phadoop2 clean install
+For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, enable the "hadoop2-yarn" profile:
 
-It uses the scala-maven-plugin which supports incremental and continuous compilation. E.g.
+    # Apache Hadoop 2.0.5-alpha
+    $ mvn -Phadoop2-yarn -Dhadoop.version=2.0.5-alpha clean install
 
-    $ mvn -Phadoop2 scala:cc
+    # Cloudera CDH 4.2.0 with MapReduce v2
+    $ mvn -Phadoop2-yarn -Dhadoop.version=2.0.0-cdh4.2.0 clean install
 
-…should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively.
 
 ## Spark Tests in Maven ##
 
@@ -31,11 +35,11 @@ Tests are run by default via the scalatest-maven-plugin. With this you can do th
 
 Skip test execution (but not compilation):
 
-    $ mvn -DskipTests -Phadoop2 clean install
+    $ mvn -Dhadoop.version=... -DskipTests clean install
 
 To run a specific test suite:
 
-    $ mvn -Phadoop2 -Dsuites=spark.repl.ReplSuite test
+    $ mvn -Dhadoop.version=... -Dsuites=spark.repl.ReplSuite test
 
 
 ## Setting up JVM Memory Usage Via Maven ##
@@ -53,6 +57,15 @@ To fix these, you can do the following:
     export MAVEN_OPTS="-Xmx1024m -XX:MaxPermSize=128M"
 
 
+## Continuous Compilation ##
+
+We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
+
+    $ mvn scala:cc
+
+…should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively.
+
+
 ## Using With IntelliJ IDEA ##
 
 This setup works fine in IntelliJ IDEA 11.1.4. After opening the project via the pom.xml file in the project root folder, you only need to activate either the hadoop1 or hadoop2 profile in the "Maven Properties" popout. We have not tried Eclipse/Scala IDE with this.
@@ -61,6 +74,6 @@ This setup works fine in IntelliJ IDEA 11.1.4. After opening the project via the
 
 It includes support for building a Debian package containing a 'fat-jar' which includes the repl, the examples and bagel. This can be created by specifying the deb profile:
 
-    $ mvn -Phadoop2,deb clean install
+    $ mvn -Pdeb clean install
 
 The debian package can then be found under repl/target. We added the short commit hash to the file name so that we can distinguish individual packages build for SNAPSHOT versions.
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 9c2cedfd88..6bada9bdd7 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -6,7 +6,7 @@ title: Launching Spark on YARN
 Experimental support for running over a [YARN (Hadoop
 NextGen)](http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html)
 cluster was added to Spark in version 0.6.0.  This was merged into master as part of 0.7 effort.
-To build spark core with YARN support, please use the hadoop2-yarn profile.
+To build spark with YARN support, please use the hadoop2-yarn profile.
 Ex:  mvn -Phadoop2-yarn clean install
 
 # Building spark core consolidated jar.
@@ -15,18 +15,12 @@ We need a consolidated spark core jar (which bundles all the required dependenci
 This can be built either through sbt or via maven.
 
 -   Building spark assembled jar via sbt.
-    It is a manual process of enabling it in project/SparkBuild.scala.
-Please comment out the
-  HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
-variables before the line 'For Hadoop 2 YARN support'
-Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
+Enable YARN support by setting `SPARK_WITH_YARN=true` when invoking sbt:
 
-Assembly of the jar Ex:
-
-    ./sbt/sbt clean assembly
+    SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt clean assembly
 
 The assembled jar would typically be something like :
-`./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar`
+`./yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar`
 
 
 -   Building spark assembled jar via Maven.
@@ -34,16 +28,16 @@ The assembled jar would typically be something like :
 
 Something like this. Ex:
 
-    mvn -Phadoop2-yarn clean package -DskipTests=true
+    mvn -Phadoop2-yarn -Dhadoop.version=2.0.5-alpha clean package -DskipTests=true
 
 
 This will build the shaded (consolidated) jar. Typically something like :
-`./repl-bin/target/spark-repl-bin-<VERSION>-shaded-hadoop2-yarn.jar`
+`./yarn/target/spark-yarn-bin-<VERSION>-shaded.jar`
 
 
 # Preparations
 
-- Building spark core assembled jar (see above).
+- Building spark-yarn assembly (see above).
 - Your application code must be packaged into a separate JAR file.
 
 If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt package`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.
-- 
cgit v1.2.3