Added cluster overview doc, made logo higher-resolution, and added more

details on monitoring
author: Matei Zaharia <matei@eecs.berkeley.edu> 2013-09-08 00:41:18 -0400
committer: Matei Zaharia <matei@eecs.berkeley.edu> 2013-09-08 00:29:11 -0700
commit: f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae (patch)
tree: 76b1b22ca6a947c4ba649cc6b85763ce5e8578b0 /docs
parent: 651a96adf7b53085bd810e153f8eabf52eed1994 (diff)
download: spark-f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae.tar.gz
spark-f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae.tar.bz2
spark-f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae.zip
7 files changed, 88 insertions, 15 deletions
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 5034111ecb..238ad26de0 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -51,7 +51,7 @@
             <div class="navbar-inner">
                 <div class="container">
                     <div class="brand"><a href="index.html">
-                      <img src="img/spark-logo-77x50px-hd.png" /></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span>
+                      <img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span>
                     </div>
                     <ul class="nav">
                         <!--TODO(andyk): Add class="active" attribute to li some how.-->
@@ -103,6 +103,7 @@
                                 <li><a href="hadoop-third-party-distributions.html">Running with CDH/HDP</a></li>
                                 <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
                                 <li><a href="job-scheduling.html">Job Scheduling</a></li>
+                                <li class="divider"></li>
                                 <li><a href="building-with-maven.html">Building Spark with Maven</a></li>
                                 <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
                             </ul>
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
index 9e781bbf1f..143f93171f 100644
--- a/docs/cluster-overview.md
+++ b/docs/cluster-overview.md
@@ -3,3 +3,68 @@ layout: global
 title: Cluster Mode Overview
 ---
 
+This document gives a short overview of how Spark runs on clusters, to make it easier to understand
+the components involved.
+
+# Components
+
+Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
+object in your main program (called the _driver program_).
+Specifically, to run on a cluster, the SparkContext can connect to several types of _cluster managers_
+(either Spark's own standalone cluster manager or Mesos/YARN), which allocate resources across
+applications. Once connected, Spark acquires *executors* on nodes in the cluster, which are
+worker processes that run computations and store data for your application. 
+Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to
+the executors. Finally, SparkContext sends *tasks* for the executors to run.
+
+<p style="text-align: center;">
+  <img src="img/cluster-overview.png" title="Spark cluster components" alt="Spark cluster components" />
+</p>
+
+There are several useful things to note about this architecture:
+
+1. Each application gets its own executor processes, which stay up for the duration of the whole
+   application and run tasks in multiple threads. This has the benefit of isolating applications
+   from each other, on both the scheduling side (each driver schedules its own tasks) and executor
+   side (tasks from different applications run in different JVMs). However, it also means that
+   data cannot be shared across different Spark applications (instances of SparkContext) without
+   writing it to an external storage system.
+2. Spark is agnostic to the underlying cluster manager. As long as it can acquire executor
+   processes, and these communicate with each other, it is relatively easy to run it even on a
+   cluster manager that also supports other applications (e.g. Mesos/YARN).
+3. Because the driver schedules tasks on the cluster, it should be run close to the worker
+   nodes, preferably on the same local area network. If you'd like to send requests to the
+   cluster remotely, it's better to open an RPC to the driver and have it submit operations
+   from nearby than to run a driver far away from the worker nodes.
+
+# Cluster Manager Types
+
+The system currently supports three cluster managers:
+
+* [Standalone](spark-standalone.html) -- a simple cluster manager included with Spark that makes it
+  easy to set up a cluster.
+* [Apache Mesos](running-on-mesos.html) -- a general cluster manager that can also run Hadoop MapReduce
+  and service applications.
+* [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2.0.
+
+In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone
+cluster on Amazon EC2.
+
+# Shipping Code to the Cluster
+
+The recommended way to ship your code to the cluster is to pass it through SparkContext's constructor,
+which takes a list of JAR files (Java/Scala) or .egg and .zip libraries (Python) to disseminate to
+worker nodes. You can also dynamically add new files to be sent to executors with `SparkContext.addJar`
+and `addFile`.
+
+# Monitoring
+
+Each driver program has a web UI, typically on port 3030, that displays information about running
+tasks, executors, and storage usage. Simply go to `http://<driver-node>:3030` in a web browser to
+access this UI. The [monitoring guide](monitoring.html) also describes other monitoring options.
+
+# Job Scheduling
+
+Spark gives control over resource allocation both _across_ applications (at the level of the cluster
+manager) and _within_ applications (if multiple computations are happening on the same SparkContext).
+The [job scheduling overview](job-scheduling.html) describes this in more detail.
diff --git a/docs/img/cluster-overview.png b/docs/img/cluster-overview.png
new file mode 100644
index 0000000000..2a1cf02fcf
--- /dev/null
+++ b/docs/img/cluster-overview.png
diff --git a/docs/img/cluster-overview.pptx b/docs/img/cluster-overview.pptx
new file mode 100644
index 0000000000..2a61db352d
--- /dev/null
+++ b/docs/img/cluster-overview.pptx
diff --git a/docs/img/spark-logo-hd.png b/docs/img/spark-logo-hd.png
new file mode 100644
index 0000000000..1381e3004d
--- /dev/null
+++ b/docs/img/spark-logo-hd.png
diff --git a/docs/index.md b/docs/index.md
index 1814cb19c8..bd386a8a8f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -48,11 +48,6 @@ options for deployment:
 * [Apache Mesos](running-on-mesos.html)
 * [Hadoop YARN](running-on-yarn.html)
 
-There is a script, `./make-distribution.sh`, which will create a binary distribution of Spark for deployment
-to any machine with only the Java runtime as a necessary dependency.
-Running the script creates a distribution directory in `dist/`, or the `-tgz` option to create a .tgz file.
-Check the script for additional options.
-
 # A Note About Hadoop Versions
 
 Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported
diff --git a/docs/monitoring.md b/docs/monitoring.md
index 0ec987107c..e9832e0466 100644
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@@ -3,19 +3,30 @@ layout: global
 title: Monitoring and Instrumentation
 ---
 
-There are several ways to monitor the progress of Spark jobs.
+There are several ways to monitor Spark applications.
 
 # Web Interfaces
-When a SparkContext is initialized, it launches a web server (by default at port 3030) which 
-displays useful information. This includes a list of active and completed scheduler stages, 
-a summary of RDD blocks and partitions, and environmental information. If multiple SparkContexts
-are running on the same host, they will bind to succesive ports beginning with 3030 (3031, 3032, 
-etc).
 
-Spark's Standlone Mode scheduler also has its own 
-[web interface](spark-standalone.html#monitoring-and-logging). 
+Every SparkContext launches a web UI, by default on port 3030, that 
+displays useful information about the application. This includes:
+
+* A list of scheduler stages and tasks
+* A summary of RDD sizes and memory usage
+* Information about the running executors
+* Environmental information.
+
+You can access this interface by simply opening `http://<driver-node>:3030` in a web browser.
+If multiple SparkContexts are running on the same host, they will bind to succesive ports
+beginning with 3030 (3031, 3032, etc).
+
+Spark's Standlone Mode cluster manager also has its own 
+[web UI](spark-standalone.html#monitoring-and-logging). 
+
+Note that in both of these UIs, the tables are sortable by clicking their headers,
+making it easy to identify slow tasks, data skew, etc.
+
+# Metrics
 
-# Spark Metrics
 Spark has a configurable metrics system based on the 
 [Coda Hale Metrics Library](http://metrics.codahale.com/). 
 This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV 
@@ -35,6 +46,7 @@ The syntax of the metrics configuration file is defined in an example configurat
 `$SPARK_HOME/conf/metrics.conf.template`.
 
 # Advanced Instrumentation
+
 Several external tools can be used to help profile the performance of Spark jobs:
 
 * Cluster-wide monitoring tools, such as [Ganglia](http://ganglia.sourceforge.net/), can provide
author	Matei Zaharia <matei@eecs.berkeley.edu>	2013-09-08 00:41:18 -0400
committer	Matei Zaharia <matei@eecs.berkeley.edu>	2013-09-08 00:29:11 -0700
commit	f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae (patch)
tree	76b1b22ca6a947c4ba649cc6b85763ce5e8578b0 /docs
parent	651a96adf7b53085bd810e153f8eabf52eed1994 (diff)
download	spark-f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae.tar.gz spark-f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae.tar.bz2 spark-f261d2a60fe9c0ec81c7a93a24fd79062c31f7ae.zip