Merge remote-tracking branch 'upstream/master'

author: Ameet Talwalkar <atalwalkar@gmail.com> 2013-09-08 18:41:38 -0700
committer: Ameet Talwalkar <atalwalkar@gmail.com> 2013-09-08 18:41:38 -0700
commit: bf280c8b0faa542061a42f9ea56e93380b6d37f8 (patch)
tree: 5640cfb29ccc3098e550e93233dda1e2e889a657 /docs
parent: 5ac62dbbd0d604d699017a5956f3c79172e09896 (diff)
parent: f68848d95d896b578235c063be51483b4fce518e (diff)
download: spark-bf280c8b0faa542061a42f9ea56e93380b6d37f8.tar.gz
spark-bf280c8b0faa542061a42f9ea56e93380b6d37f8.tar.bz2
spark-bf280c8b0faa542061a42f9ea56e93380b6d37f8.zip
6 files changed, 179 insertions, 13 deletions
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 84749fda4e..90928c8021 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -97,7 +97,9 @@
                             <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
                             <ul class="dropdown-menu">
                                 <li><a href="configuration.html">Configuration</a></li>
+                                <li><a href="monitoring.html">Monitoring</a></li>
                                 <li><a href="tuning.html">Tuning Guide</a></li>
+                                <li><a href="hadoop-third-party-distributions.html">Running with CDH/HDP</a></li>
                                 <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
                                 <li><a href="building-with-maven.html">Building Spark with Maven</a></li>
                                 <li><a href="contributing-to-spark.html">Contributing to Spark</a></li>
diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md
new file mode 100644
index 0000000000..9f4f354525
--- /dev/null
+++ b/docs/hadoop-third-party-distributions.md
@@ -0,0 +1,76 @@
+---
+layout: global
+title: Running with Cloudera and HortonWorks Distributions
+---
+
+Spark can run against all versions of Cloudera's Distribution Including Hadoop (CDH) and
+the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark with
+these distributions:
+
+# Compile-time Hadoop Version
+When compiling Spark, you'll need to 
+[set the HADOOP_VERSION flag](http://localhost:4000/index.html#a-note-about-hadoop-versions):
+
+    HADOOP_VERSION=1.0.4 sbt/sbt assembly
+
+The table below lists the corresponding HADOOP_VERSION for each CDH/HDP release. Note that
+some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
+distribution may "just work" without you needing to compile. That said, we recommend compiling with 
+the _exact_ Hadoop version you are running to avoid any compatibility errors.
+
+<table>
+  <tr valign="top">
+    <td>
+      <h3>CDH Releases</h3>
+      <table class="table" style="width:350px;">
+        <tr><th>Version</th><th>HADOOP_VERSION</th></tr>
+        <tr><td>CDH 4.X.X (YARN mode)</td><td>2.0.0-chd4.X.X</td></tr>
+        <tr><td>CDH 4.X.X</td><td>2.0.0-mr1-chd4.X.X</td></tr>
+        <tr><td>CDH 3u6</td><td>0.20.2-cdh3u6</td></tr>
+        <tr><td>CDH 3u5</td><td>0.20.2-cdh3u5</td></tr>
+        <tr><td>CDH 3u4</td><td>0.20.2-cdh3u4</td></tr>
+      </table>
+    </td>
+    <td>
+      <h3>HDP Releases</h3>
+      <table class="table" style="width:350px;">
+        <tr><th>Version</th><th>HADOOP_VERSION</th></tr>
+        <tr><td>HDP 1.3</td><td>1.2.0</td></tr>
+        <tr><td>HDP 1.2</td><td>1.1.2</td></tr>
+        <tr><td>HDP 1.1</td><td>1.0.3</td></tr>
+        <tr><td>HDP 1.0</td><td>1.0.3</td></tr>
+      </table>
+    </td>
+  </tr>
+</table>
+
+# Where to Run Spark
+As described in the [Hardware Provisioning](hardware-provisioning.html#storage-systems) guide,
+Spark can run in a variety of deployment modes:
+
+* Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your
+  Hadoop installation.
+* Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and 
+  cores dedicated to Spark on each node.
+* Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos.
+
+These options are identical for those using CDH and HDP. 
+
+# Inheriting Cluster Configuration
+If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that
+should be included on Spark's classpath:
+
+* `hdfs-site.xml`, which provides default behaviors for the HDFS client.
+* `core-site.xml`, which sets the default filesystem name.
+
+The location of these configuration files varies across CDH and HDP versions, but
+a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create
+configurations on-the-fly, but offer a mechanisms to download copies of them.
+
+There are a few ways to make these files visible to Spark:
+
+* You can copy these files into `$SPARK_HOME/conf` and they will be included in Spark's
+classpath automatically.
+* If you are running Spark on the same nodes as Hadoop _and_ your distribution includes both
+`hdfs-site.xml` and `core-site.xml` in the same directory, you can set `HADOOP_CONF_DIR` 
+in `$SPARK_HOME/spark-env.sh` to that directory.
diff --git a/docs/index.md b/docs/index.md
index 7d73929940..d3aacc629f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -46,6 +46,11 @@ Spark supports several options for deployment:
 * [Apache Mesos](running-on-mesos.html)
 * [Hadoop YARN](running-on-yarn.html)
 
+There is a script, `./make-distribution.sh`, which will create a binary distribution of Spark for deployment
+to any machine with only the Java runtime as a necessary dependency.
+Running the script creates a distribution directory in `dist/`, or the `-tgz` option to create a .tgz file.
+Check the script for additional options.
+
 # A Note About Hadoop Versions
 
 Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported
diff --git a/docs/monitoring.md b/docs/monitoring.md
new file mode 100644
index 0000000000..4c4f174503
--- /dev/null
+++ b/docs/monitoring.md
@@ -0,0 +1,58 @@
+---
+layout: global
+title: Monitoring and Instrumentation
+---
+
+There are several ways to monitor the progress of Spark jobs.
+
+# Web Interfaces
+When a SparkContext is initialized, it launches a web server (by default at port 3030) which 
+displays useful information. This includes a list of active and completed scheduler stages, 
+a summary of RDD blocks and partitions, and environmental information. If multiple SparkContexts
+are running on the same host, they will bind to succesive ports beginning with 3030 (3031, 3032, 
+etc).
+
+Spark's Standlone Mode scheduler also has its own 
+[web interface](spark-standalone.html#monitoring-and-logging). 
+
+# Spark Metrics
+Spark has a configurable metrics system based on the 
+[Coda Hale Metrics Library](http://metrics.codahale.com/). 
+This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV 
+files. The metrics system is configured via a configuration file that Spark expects to be present 
+at `$SPARK_HOME/conf/metrics.conf`. A custom file location can be specified via the 
+`spark.metrics.conf` Java system property. Spark's metrics are decoupled into different 
+_instances_ corresponding to Spark components. Within each instance, you can configure a 
+set of sinks to which metrics are reported. The following instances are currently supported:
+
+* `master`: The Spark standalone master process.
+* `applications`: A component within the master which reports on various applications.
+* `worker`: A Spark standalone worker process.
+* `executor`: A Spark executor.
+* `driver`: The Spark driver process (the process in which your SparkContext is created).
+
+Each instance can report to zero or more _sinks_. Sinks are contained in the
+`org.apache.spark.metrics.sink` package:
+
+* `ConsoleSink`: Logs metrics information to the console.
+* `CSVSink`: Exports metrics data to CSV files at regular intervals.
+* `GangliaSink`: Sends metrics to a Ganglia node or multicast group.
+* `JmxSink`: Registers metrics for viewing in a JXM console.
+* `MetricsServlet`: Adds a servlet within the existing Spark UI to serve metrics data as JSON data.
+
+The syntax of the metrics configuration file is defined in an example configuration file, 
+`$SPARK_HOME/conf/metrics.conf.template`.
+
+# Advanced Instrumentation
+Several external tools can be used to help profile the performance of Spark jobs:
+
+* Cluster-wide monitoring tools, such as [Ganglia](http://ganglia.sourceforge.net/), can provide 
+insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia 
+dashboard can quickly reveal whether a particular workload is disk bound, network bound, or 
+CPU bound.
+* OS profiling tools such as [dstat](http://dag.wieers.com/home-made/dstat/), 
+[iostat](http://linux.die.net/man/1/iostat), and [iotop](http://linux.die.net/man/1/iotop) 
+can provide fine-grained profiling on individual nodes.
+* JVM utilities such as `jstack` for providing stack traces, `jmap` for creating heap-dumps, 
+`jstat` for reporting time-series statistics and `jconsole` for visually exploring various JVM 
+properties are useful for those comfortable with JVM internals.
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 93421efcbc..c611db0af4 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -42,7 +42,7 @@ This would be used to connect to the cluster, write to the dfs and submit jobs t
 
 The command to launch the YARN Client is as follows:
 
-    SPARK_JAR=<SPARK_YARN_JAR_FILE> ./spark-class org.apache.spark.deploy.yarn.Client \
+    SPARK_JAR=<SPARK_ASSEMBLY_JAR_FILE> ./spark-class org.apache.spark.deploy.yarn.Client \
       --jar <YOUR_APP_JAR_FILE> \
       --class <APP_MAIN_CLASS> \
       --args <APP_MAIN_ARGUMENTS> \
@@ -54,14 +54,27 @@ The command to launch the YARN Client is as follows:
 
 For example:
 
-    SPARK_JAR=./yarn/target/spark-yarn-assembly-{{site.SPARK_VERSION}}.jar ./spark-class org.apache.spark.deploy.yarn.Client \
-      --jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}.jar \
-      --class org.apache.spark.examples.SparkPi \
-      --args yarn-standalone \
-      --num-workers 3 \
-      --master-memory 4g \
-      --worker-memory 2g \
-      --worker-cores 1
+    # Build the Spark assembly JAR and the Spark examples JAR
+    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true ./sbt/sbt assembly
+
+    # Configure logging
+    $ cp conf/log4j.properties.template conf/log4j.properties
+
+    # Submit Spark's ApplicationMaster to YARN's ResourceManager, and instruct Spark to run the SparkPi example
+    $ SPARK_JAR=./assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop2.0.5-alpha.jar \
+        ./spark-class org.apache.spark.deploy.yarn.Client \
+          --jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples-assembly-{{site.SPARK_VERSION}}.jar \
+          --class org.apache.spark.examples.SparkPi \
+          --args yarn-standalone \
+          --num-workers 3 \
+          --master-memory 4g \
+          --worker-memory 2g \
+          --worker-cores 1
+
+    # Examine the output (replace $YARN_APP_ID in the following with the "application identifier" output by the previous command)
+    # (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.)
+    $ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_000001/stdout
+    Pi is roughly 3.13794
 
 The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.
 
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 994a96f2c9..69e1291580 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -3,13 +3,21 @@ layout: global
 title: Spark Standalone Mode
 ---
 
-In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided [deploy scripts](#cluster-launch-scripts). It is also possible to run these daemons on a single machine for testing.
+In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided [launch scripts](#cluster-launch-scripts). It is also possible to run these daemons on a single machine for testing.
+
+# Installing Spark Standalone to a Cluster
+
+The easiest way to deploy Spark is by running the `./make-distribution.sh` script to create a binary distribution.
+This distribution can be deployed to any machine with the Java runtime installed; there is no need to install Scala.
+
+The recommended procedure is to deploy and start the master on one node first, get the master spark URL,
+then modify `conf/spark-env.sh` in the `dist/` directory before deploying to all the other nodes.
 
 # Starting a Cluster Manually
 
 You can start a standalone master server by executing:
 
-    ./spark-class org.apache.spark.deploy.master.Master
+    ./bin/start-master.sh
 
 Once started, the master will print out a `spark://HOST:PORT` URL for itself, which you can use to connect workers to it,
 or pass as the "master" argument to `SparkContext`. You can also find this URL on
@@ -22,7 +30,7 @@ Similarly, you can start one or more workers and connect them to the master via:
 Once you have started a worker, look at the master's web UI ([http://localhost:8080](http://localhost:8080) by default).
 You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
 
-Finally, the following configuration options can be passed to the master and worker: 
+Finally, the following configuration options can be passed to the master and worker:
 
 <table class="table">
   <tr><th style="width:21%">Argument</th><th>Meaning</th></tr>
@@ -55,7 +63,7 @@ Finally, the following configuration options can be passed to the master and wor
 
 # Cluster Launch Scripts
 
-To launch a Spark standalone cluster with the deploy scripts, you need to create a file called `conf/slaves` in your Spark directory, which should contain the hostnames of all the machines where you would like to start Spark workers, one per line. The master machine must be able to access each of the slave machines via password-less `ssh` (using a private key). For testing, you can just put `localhost` in this file.
+To launch a Spark standalone cluster with the launch scripts, you need to create a file called `conf/slaves` in your Spark directory, which should contain the hostnames of all the machines where you would like to start Spark workers, one per line. The master machine must be able to access each of the slave machines via password-less `ssh` (using a private key). For testing, you can just put `localhost` in this file.
 
 Once you've set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop's deploy scripts, and available in `SPARK_HOME/bin`:
 
@@ -134,6 +142,10 @@ To run an interactive Spark shell against the cluster, run the following command
 
     MASTER=spark://IP:PORT ./spark-shell
 
+Note that if you are running spark-shell from one of the spark cluster machines, the `spark-shell` script will
+automatically set MASTER from the `SPARK_MASTER_IP` and `SPARK_MASTER_PORT` variables in `conf/spark-env.sh`.
+
+You can also pass an option `-c <numCores>` to control the number of cores that spark-shell uses on the cluster.
 
 # Job Scheduling
author	Ameet Talwalkar <atalwalkar@gmail.com>	2013-09-08 18:41:38 -0700
committer	Ameet Talwalkar <atalwalkar@gmail.com>	2013-09-08 18:41:38 -0700
commit	bf280c8b0faa542061a42f9ea56e93380b6d37f8 (patch)
tree	5640cfb29ccc3098e550e93233dda1e2e889a657 /docs
parent	5ac62dbbd0d604d699017a5956f3c79172e09896 (diff)
parent	f68848d95d896b578235c063be51483b4fce518e (diff)
download	spark-bf280c8b0faa542061a42f9ea56e93380b6d37f8.tar.gz spark-bf280c8b0faa542061a42f9ea56e93380b6d37f8.tar.bz2 spark-bf280c8b0faa542061a42f9ea56e93380b6d37f8.zip