aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorMatei Zaharia <matei@eecs.berkeley.edu>2012-10-09 14:30:23 -0700
committerMatei Zaharia <matei@eecs.berkeley.edu>2012-10-09 14:30:23 -0700
commitbc0bc672d02e8f5f12cd1e14863db36c42acff96 (patch)
tree826f2673c093d3a982cfe6f96242725ff0a2089f /docs
parentad28aebb0adfe3710bfcf741fbc9105282ee67a8 (diff)
downloadspark-bc0bc672d02e8f5f12cd1e14863db36c42acff96.tar.gz
spark-bc0bc672d02e8f5f12cd1e14863db36c42acff96.tar.bz2
spark-bc0bc672d02e8f5f12cd1e14863db36c42acff96.zip
Updates to documentation:
- Edited quick start and tuning guide to simplify them a little - Simplified top menu bar - Made private a SparkContext constructor parameter that was left as public - Various small fixes
Diffstat (limited to 'docs')
-rwxr-xr-xdocs/_layouts/global.html10
-rwxr-xr-xdocs/css/main.css3
-rw-r--r--docs/index.md11
-rw-r--r--docs/java-programming-guide.md2
-rw-r--r--docs/quick-start.md94
-rw-r--r--docs/scala-programming-guide.md8
-rw-r--r--docs/tuning.md105
7 files changed, 123 insertions, 110 deletions
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index e9637dc150..dbae371513 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -43,8 +43,8 @@
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quick-start.html">Quick Start</a></li>
- <li><a href="scala-programming-guide.html">Scala Programming Guide</a></li>
- <li><a href="java-programming-guide.html">Java Programming Guide</a></li>
+ <li><a href="scala-programming-guide.html">Scala</a></li>
+ <li><a href="java-programming-guide.html">Java</a></li>
</ul>
</li>
@@ -55,8 +55,8 @@
<ul class="dropdown-menu">
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
<li><a href="spark-standalone.html">Standalone Mode</a></li>
- <li><a href="running-on-mesos.html">On Mesos</a></li>
- <li><a href="running-on-yarn.html">On YARN</a></li>
+ <li><a href="running-on-mesos.html">Mesos</a></li>
+ <li><a href="running-on-yarn.html">YARN</a></li>
</ul>
</li>
@@ -69,8 +69,8 @@
<li><a href="contributing-to-spark.html">Contributing to Spark</a></li>
</ul>
</li>
- <p class="navbar-text pull-right"><span class="version-text">v{{site.SPARK_VERSION}}</span></p>
</ul>
+ <!--<p class="navbar-text pull-right"><span class="version-text">v{{site.SPARK_VERSION}}</span></p>-->
</div>
</div>
</div>
diff --git a/docs/css/main.css b/docs/css/main.css
index 83fc7c8ec9..f9f5c7f8dd 100755
--- a/docs/css/main.css
+++ b/docs/css/main.css
@@ -25,8 +25,7 @@
}
.navbar-text .version-text {
- border: solid thin lightgray;
- border-radius: 6px;
+ color: #555555;
padding: 5px;
margin-left: 10px;
}
diff --git a/docs/index.md b/docs/index.md
index 92a7aef5a5..791be4c097 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -15,7 +15,7 @@ Amazon EC2, or without an independent resource manager ("standalone mode").
# Downloading
-Get Spark by visiting the [downloads page](http://spark-project.org/downloads.html) of the Spark website. This documentation corresponds to Spark {{site.SPARK_VERSION}}.
+Get Spark by visiting the [downloads page](http://spark-project.org/downloads.html) of the Spark website. This documentation is for Spark version {{site.SPARK_VERSION}}.
# Building
@@ -54,19 +54,16 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
# Where to Go from Here
-**Quick start:**
-
-* [Spark Quick Start](quick-start.html): a quick intro to the Spark API
-
**Programming guides:**
-* [Spark Programming Guide](scala-programming-guide.html): how to get started using Spark, and details on the Scala API
+* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
+* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
* [Java Programming Guide](java-programming-guide.html): using Spark from Java
**Deployment guides:**
* [Running Spark on Amazon EC2](ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
-* [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without Mesos
+* [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without a third-party cluster manager
* [Running Spark on Mesos](running-on-mesos.html): deploy a private cluster using
[Apache Mesos](http://incubator.apache.org/mesos)
* [Running Spark on YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
index 9b870e4081..24aa2d5c6b 100644
--- a/docs/java-programming-guide.md
+++ b/docs/java-programming-guide.md
@@ -1,6 +1,6 @@
---
layout: global
-title: Spark Java Programming Guide
+title: Java Programming Guide
---
The Spark Java API exposes all the Spark features available in the Scala version to Java.
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 51e60426b5..7d35fb01bb 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -1,35 +1,34 @@
---
layout: global
-title: Spark Quick Start
+title: Quick Start
---
* This will become a table of contents (this text will be scraped).
{:toc}
-# Introduction
+This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's interactive Scala shell (don't worry if you don't know Scala -- you will need much for this), then show how to write standalone jobs in Scala and Java. See the [programming guide](scala-programming-guide.html) for a fuller reference.
-This document provides a quick-and-dirty look at Spark's API. See the [programming guide](scala-programming-guide.html) for a complete reference. To follow along with this guide, you only need to have successfully built Spark on one machine. Building Spark is as simple as running
+To follow along with this guide, you only need to have successfully built Spark on one machine. Simply go into your Spark directory and run:
{% highlight bash %}
$ sbt/sbt package
{% endhighlight %}
-from within the Spark directory.
+# Interactive Analysis with the Spark Shell
-# Interactive Data Analysis with the Spark Shell
+## Basics
-## Shell basics
+Spark's interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively.
+Start the shell by running `./spark-shell` in the Spark directory.
-Start the Spark shell by executing `./spark-shell` in the Spark directory.
-
-Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDD's can be created from Hadoop InputFormat's (such as HDFS files) or by transforming other RDD's. Let's make a new RDD derived from the text of the README file in the Spark source directory:
+Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a new RDD from the text of the README file in the Spark source directory:
{% highlight scala %}
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
{% endhighlight %}
-RDD's have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDD's. Let's start with a few actions:
+RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
{% highlight scala %}
scala> textFile.count() // Number of items in this RDD
@@ -39,11 +38,11 @@ scala> textFile.first() // First item in this RDD
res1: String = # Spark
{% endhighlight %}
-Now let's use a transformation. We will use the [filter](scala-programming-guide.html#transformations)() transformation to return a new RDD with a subset of the items in the file.
+Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
{% highlight scala %}
-scala> val sparkLinesOnly = textFile.filter(line => line.contains("Spark"))
-sparkLinesOnly: spark.RDD[String] = spark.FilteredRDD@7dd4af09
+scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
+linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
{% endhighlight %}
We can chain together transformations and actions:
@@ -53,18 +52,18 @@ scala> textFile.filter(line => line.contains("Spark")).count() // How many lines
res3: Long = 15
{% endhighlight %}
-## Data flow
+## Transformations
RDD transformations can be used for more complex computations. Let's say we want to find the line with the most words:
{% highlight scala %}
-scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a < b) {b} else {a})
+scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 16
{% endhighlight %}
-This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to [map](scala-programming-guide.html#transformations)() and [reduce](scala-programming-guide.html#actions)() are scala closures. We can easily include functions declared elsewhere, or include existing functions in our anonymous closures. For instance, we can use `Math.max()` to make this code easier to understand.
+This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to `map` and `reduce` are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We'll use `Math.max()` function to make this code easier to understand:
{% highlight scala %}
-scala> import java.lang.Math;
+scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
@@ -74,36 +73,35 @@ res5: Int = 16
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
{% highlight scala %}
-scala> val wordCountRDD = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((c1, c2) => c1 + c2)
-wordCountRDD: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
+scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
+wordCounts: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
{% endhighlight %}
-Here, we combined the [flatMap](scala-programming-guide.html#transformations)(), [map](scala-programming-guide.html#transformations)() and [reduceByKey](scala-programming-guide.html#transformations)() transformations to create per-word counts in the file. To collect the word counts in our shell, we can use the [collect](scala-programming-guide.html#actions)() action:
+Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
{% highlight scala %}
-scala> wordCountRDD.collect()
+scala> wordCounts.collect()
res6: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...
{% endhighlight %}
## Caching
-Spark also supports pulling data sets into a cluster-wide cache. This is very useful when data is accessed iteratively, such as in machine learning jobs, or repeatedly, such as when small "hot data" is queried repeatedly. As a simple example, let's pull part of our file into memory:
-
+Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small "hot" dataset or when running an iterative algorithm like PageRank. As a simple example, let's mark our `linesWithSpark` dataset to be cached:
{% highlight scala %}
-scala> val linesWithSparkCached = linesWithSpark.cache()
-linesWithSparkCached: spark.RDD[String] = spark.FilteredRDD@17e51082
-
-scala> linesWithSparkCached.count()
-res7: Long = 15
+scala> linesWithSpark.cache()
+res7: spark.RDD[String] = spark.FilteredRDD@17e51082
-scala> linesWithSparkCached.count()
+scala> linesWithSpark.count()
res8: Long = 15
+
+scala> linesWithSpark.count()
+res9: Long = 15
{% endhighlight %}
-It may seem silly to use a Spark to explore and cache a 30 line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes.
+It may seem silly to use a Spark to explore and cache a 30-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting `spark-shell` to a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
-# A Spark Job in Scala
-Now say we wanted to write custom job using the Spark API. We will walk through a simple job in both Scala (with sbt) and Java (with maven). If you using other build systems, please reference the Spark assembly jar in the developer guide. The first step is to publish Spark to our local Ivy/Maven repositories. From the Spark directory:
+# A Standalone Job in Scala
+Now say we wanted to write a standalone job using the Spark API. We will walk through a simple job in both Scala (with sbt) and Java (with maven). If you using other build systems, please reference the Spark assembly JAR in the developer guide. The first step is to publish Spark to our local Ivy/Maven repositories. From the Spark directory:
{% highlight bash %}
$ sbt/sbt publish-local
@@ -117,7 +115,7 @@ import spark.SparkContext
import SparkContext._
object SimpleJob extends Application {
- val logFile = "/var/log/syslog" // Should be some log file on your system
+ val logFile = "/var/log/syslog" // Should be some file on your system
val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME",
"target/scala-{{site.SCALA_VERSION}}/simple-project_{{site.SCALA_VERSION}}-1.0.jar")
val logData = sc.textFile(logFile, 2).cache()
@@ -127,7 +125,7 @@ object SimpleJob extends Application {
}
{% endhighlight %}
-This job simply counts the number of lines containing 'a' and the number containing 'b' in a system log file. Unlike the earlier examples with the Spark Shell, which initializes its own SparkContext, we initialize a SparkContext as part of the job. We pass the SparkContext constructor four arguments, the type of scheduler we want to use (in this case, a local scheduler), a name for the job, the directory where Spark is installed, and a name for the jar file containing the job's sources. The final two arguments are needed in a distributed setting, where Spark is running across several nodes, so we include them for completeness. Spark will automatically ship the jar files you list to slave nodes.
+This job simply counts the number of lines containing 'a' and the number containing 'b' in a system log file. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the job. We pass the SparkContext constructor four arguments, the type of scheduler we want to use (in this case, a local scheduler), a name for the job, the directory where Spark is installed, and a name for the jar file containing the job's sources. The final two arguments are needed in a distributed setting, where Spark is running across several nodes, so we include them for completeness. Spark will automatically ship the jar files you list to slave nodes.
This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt` which explains that Spark is a dependency:
@@ -141,7 +139,7 @@ scalaVersion := "{{site.SCALA_VERSION}}"
libraryDependencies += "org.spark-project" %% "spark-core" % "{{site.SPARK_VERSION}}"
{% endhighlight %}
-Of course, for sbt to work correctly, we'll need to layout `SimpleJob.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a jar package containing the job's code, then use `sbt run` to execute our example job.
+Of course, for sbt to work correctly, we'll need to layout `SimpleJob.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the job's code, then use `sbt run` to execute our example job.
{% highlight bash %}
$ find .
@@ -152,21 +150,21 @@ $ find .
./src/main/scala
./src/main/scala/SimpleJob.scala
-$ sbt clean package
+$ sbt package
$ sbt run
...
Lines with a: 8422, Lines with b: 1836
{% endhighlight %}
-This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.
+This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
-# A Spark Job In Java
-Now say we wanted to write custom job using the Spark API. We will walk through a simple job in both Scala (with sbt) and Java (with maven). If you using other build systems, please reference the Spark assembly jar in the developer guide. The first step is to publish Spark to our local Ivy/Maven repositories. From the Spark directory:
+# A Standalone Job In Java
+Now say we wanted to write a standalone job using the Java API. We will walk through doing this with Maven. If you using other build systems, please reference the Spark assembly JAR in the developer guide. The first step is to publish Spark to our local Ivy/Maven repositories. From the Spark directory:
{% highlight bash %}
$ sbt/sbt publish-local
{% endhighlight %}
-Next, we'll create a very simple Spark job in Scala. So simple, in fact, that it's named `SimpleJob.java`:
+Next, we'll create a very simple Spark job, `SimpleJob.java`:
{% highlight java %}
/*** SimpleJob.java ***/
@@ -175,7 +173,7 @@ import spark.api.java.function.Function;
public class SimpleJob {
public static void main(String[] args) {
- String logFile = "/var/log/syslog"; // Should be some log file on your system
+ String logFile = "/var/log/syslog"; // Should be some file on your system
JavaSparkContext sc = new JavaSparkContext("local", "Simple Job",
"$YOUR_SPARK_HOME", "target/simple-project-1.0.jar");
JavaRDD<String> logData = sc.textFile(logFile).cache();
@@ -188,15 +186,14 @@ public class SimpleJob {
public Boolean call(String s) { return s.contains("b"); }
}).count();
- System.out.println(String.format(
- "Lines with a: %s, Lines with b: %s", numAs, numBs));
+ System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
{% endhighlight %}
-This job simply counts the number of lines containing 'a' and the number containing 'b' in a system log file. Unlike the earlier examples with the Spark Shell, which initializes its own SparkContext, we initialize a SparkContext as part of the job. We pass the SparkContext constructor four arguments, the type of scheduler we want to use (in this case, a local scheduler), a name for the job, the directory where Spark is installed, and a name for the jar file containing the job's sources. The final two arguments are needed in a distributed setting, where Spark is running across several nodes, so we include them for completeness. Spark will automatically ship the jar files you list to slave nodes.
+This job simply counts the number of lines containing 'a' and the number containing 'b' in a system log file. Note that like in the Scala example, we initialize a SparkContext, though we use the special `JavaSparkContext` class to get a Java-friendly one. We also create RDDs (represented by `JavaRDD`) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend `spark.api.java.function.Function`. The [Java programming guide]("java-programming-guide") describes these differences in more detail.
-Our Maven `pom.xml` file will list Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.
+To build the job, we also write a Maven `pom.xml` file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.
{% highlight xml %}
<project>
@@ -216,7 +213,7 @@ Our Maven `pom.xml` file will list Spark as a dependency. Note that Spark artifa
</project>
{% endhighlight %}
-To make Maven happy, we lay out these files according to the canonical directory structure:
+We lay out these files according to the canonical Maven directory structure:
{% highlight bash %}
$ find .
./pom.xml
@@ -229,11 +226,10 @@ $ find .
Now, we can execute the job using Maven:
{% highlight bash %}
-$ mvn clean package
+$ mvn package
$ mvn exec:java -Dexec.mainClass="SimpleJob"
...
Lines with a: 8422, Lines with b: 1836
{% endhighlight %}
-This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.
-
+This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index 76a1957efa..57a2c04b16 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -1,6 +1,6 @@
---
layout: global
-title: Spark Scala Programming Guide
+title: Scala Programming Guide
---
* This will become a table of contents (this text will be scraped).
@@ -37,7 +37,11 @@ new SparkContext(master, jobName, [sparkHome], [jars])
The `master` parameter is a string specifying a [Mesos](running-on-mesos.html) cluster to connect to, or a special "local" string to run in local mode, as described below. `jobName` is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later.
-In the Spark interpreter, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable. For example, run `MASTER=local[4] ./spark-shell` to run locally with four cores.
+In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable. For example, to run on four cores, use
+
+{% highlight bash %}
+$ MASTER=local[4] ./spark-shell
+{% endhighlight %}
### Master URLs
diff --git a/docs/tuning.md b/docs/tuning.md
index 58b52b3376..f18de8ff3a 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -3,24 +3,22 @@ layout: global
title: Tuning Spark
---
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked
by any resource in the cluster: CPU, network bandwidth, or memory.
Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
also need to do some tuning, such as
[storing RDDs in serialized form](scala-programming-guide.html#rdd-persistence), to
-make the memory usage smaller.
+decrease memory usage.
This guide will cover two main topics: data serialization, which is crucial for good network
-performance, and memory tuning. We also sketch several smaller topics.
-
-This document assumes that you have familiarity with the Spark API and have already read the [Scala](scala-programming-guide.html) or [Java](java-programming-guide.html) programming guides. After reading this guide, do not hesitate to reach out to the [Spark mailing list](http://groups.google.com/group/spark-users) with performance related concerns.
-
-# The Spark Storage Model
-Spark's key abstraction is a distributed dataset, or RDD. RDD's consist of partitions. RDD partitions are stored either in memory or on disk, with replication or without replication, depending on the chosen [persistence options](scala-programming-guide.html#rdd-persistence). When RDD's are stored in memory, they can be stored as deserialized Java objects, or in a serialized form, again depending on the persistence option chosen. When RDD's are transferred over the network, or spilled to disk, they are always serialized. Spark can use different serializers, configurable with the `spark.serializer` option.
+performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.
-# Serialization Options
+# Data Serialization
-Serialization plays an important role in the performance of Spark applications, especially those which are network-bound. The format of data sent across
-the network -- formats that are slow to serialize objects into, or consume a large number of
+Serialization plays an important role in the performance of any distributed application.
+Formats that are slow to serialize objects into, or consume a large number of
bytes, will greatly slow down the computation.
Often, this will be the first thing you should tune to optimize a Spark application.
Spark aims to strike a balance between convenience (allowing you to work with any Java type
@@ -78,11 +76,6 @@ There are three considerations in tuning memory usage: the *amount* of memory us
(you may want your entire dataset to fit in memory), the *cost* of accessing those objects, and the
overhead of *garbage collection* (if you have high turnover in terms of objects).
-## Determining memory consumption
-The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the master logs. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD. Depending on the object complexity in your dataset, and whether you are storing serialized data, memory overhead can range from 1X (e.g. no overhead vs on-disk storage) to 5X. This guide covers ways to keep memory overhead low, in cases where memory is a contended resource.
-
-## Efficient Data Structures
-
By default, Java objects are fast to access, but can easily consume a factor of 2-5x more space
than the "raw" data inside their fields. This is due to several reasons:
@@ -97,59 +90,84 @@ than the "raw" data inside their fields. This is due to several reasons:
but also pointers (typically 8 bytes each) to the next object in the list.
* Collections of primitive types often store them as "boxed" objects such as `java.lang.Integer`.
-There are several ways to reduce this cost and still make Java objects efficient to work with:
+This section will discuss how to determine the memory usage of your objects, and how to improve
+it -- either by changing your data structures, or by storing data in a serialized format.
+We will then cover tuning Spark's cache size and the Java garbage collector.
+
+## Determining Memory Consumption
+
+The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD. You will see messages like this:
+
+ INFO BlockManagerMasterActor: Added rdd_0_1 in memory on mbk.local:50311 (size: 717.5 KB, free: 332.3 MB)
+
+This means that partition 1 of RDD 0 consumed 717.5 KB.
+
+## Tuning Data Structures
+
+The first way to reduce memory consumption is to avoid the Java features that add overhead, such as
+pointer-based data structures and wrapper objects. There are several ways to do this:
1. Design your data structures to prefer arrays of objects, and primitive types, instead of the
standard Java or Scala collection classes (e.g. `HashMap`). The [fastutil](http://fastutil.di.unimi.it)
library provides convenient collection classes for primitive types that are compatible with the
Java standard library.
2. Avoid nested structures with a lot of small objects and pointers when possible.
-3. If you have less than 32 GB of RAM, set the JVM flag `-XX:+UseCompressedOops` to make pointers be
+3. Consider using numeric IDs or enumeration objects instead of strings for keys.
+4. If you have less than 32 GB of RAM, set the JVM flag `-XX:+UseCompressedOops` to make pointers be
four bytes instead of eight. Also, on Java 7 or later, try `-XX:+UseCompressedStrings` to store
ASCII strings as just 8 bits per character. You can add these options in
[`spark-env.sh`](configuration.html#environment-variables-in-spark-envsh).
+## Serialized RDD Storage
+
When your objects are still too large to efficiently store despite this tuning, a much simpler way
to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in
-the [RDD persistence API](scala-programming-guide#rdd-persistence).
+the [RDD persistence API](scala-programming-guide.html#rdd-persistence), such as `MEMORY_ONLY_SER`.
Spark will then store each RDD partition as one large byte array.
The only downside of storing data in serialized form is slower access times, due to having to
deserialize each object on the fly.
We highly recommend [using Kryo](#data-serialization) if you want to cache data in serialized form, as
it leads to much smaller sizes than Java serialization (and certainly than raw Java objects).
-Finally, JVM garbage collection can be a problem when you have large "churn" in terms of the RDDs
-stored by your program. (It is generally not a problem in programs that just read an RDD once
+## Garbage Collection Tuning
+
+JVM garbage collection can be a problem when you have large "churn" in terms of the RDDs
+stored by your program. (It is usually not a problem in programs that just read an RDD once
and then run many operations on it.) When Java needs to evict old objects to make room for new ones, it will
need to trace through all your Java objects and find the unused ones. The main point to remember here is
that *the cost of garbage collection is proportional to the number of Java objects*, so using data
-structures with fewer objects (e.g. an array of `Int`s instead of a `LinkedList`) greatly reduces
+structures with fewer objects (e.g. an array of `Int`s instead of a `LinkedList`) greatly lowers
this cost. An even better method is to persist objects in serialized form, as described above: now
-there will be only *one* object (a byte array) per RDD partition. Before trying other advanced
-techniques, the first thing to try if GC is a problem is to use serialized caching.
+there will be only *one* object (a byte array) per RDD partition. Before trying other
+techniques, the first thing to try if GC is a problem is to use [serialized caching](#serialized-rdd-storage).
+GC can also be a problem due to interference between your tasks' working memory (the
+amount of space needed to run the task) and the RDDs cached on your nodes. We will discuss how to control
+the space allocated to the RDD cache to mitigate this.
-## Cache Size Tuning
+**Measuring the Impact of GC**
-One of the important configuration parameters passed to Spark is the amount of memory that should be used for
+The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of
+time spent GC. This can be done by adding `-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps` to your
+`SPARK_JAVA_OPTS` environment variable. Next time your Spark job is run, you will see messages printed in the worker's logs
+each time a garbage collection occurs. Note these logs will be on your cluster's worker nodes (in the `stdout` files in
+their work directories), *not* on your driver program.
+
+**Cache Size Tuning**
+
+One important configuration parameter for GC is the amount of memory that should be used for
caching RDDs. By default, Spark uses 66% of the configured memory (`SPARK_MEM`) to cache RDDs. This means that
-around 33% of memory is available for any objects created during task execution.
+ 33% of memory is available for any objects created during task execution.
-In case your tasks slow down and you find that your JVM is using almost all of its allocated memory, lowering
-this value will help reducing the memory consumption. To change this to say 50%, you can call
+In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of
+memory, lowering this value will help reduce the memory consumption. To change this to say 50%, you can call
`System.setProperty("spark.storage.memoryFraction", "0.5")`. Combined with the use of serialized caching,
using a smaller cache should be sufficient to mitigate most of the garbage collection problems.
In case you are interested in further tuning the Java GC, continue reading below.
-## GC Tuning
-
-The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of
-time spent GC. This can be done by adding `-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps` to
-`SPARK_JAVA_OPTS` environment variable. Next time your Spark job is run, you will see messages printed on the
-console whenever a JVM garbage collection takes place. Note that garabage collections that occur at the executor can be
-found in the executor logs and not on the `spark-shell`.
+**Advanced GC Tuning**
-Some basic information about memory management in the JVM:
+To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:
* Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects
while the Old generation is intended for objects with longer lifetimes.
@@ -160,7 +178,7 @@ Some basic information about memory management in the JVM:
that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old
enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.
-The goal of GC-tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that
+The goal of GC tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that
the Young generation is sufficiently sized to store short-lived objects. This will help avoid full GCs to collect
temporary objects created during task execution. Some steps which may be useful are:
@@ -169,12 +187,12 @@ temporary objects created during task execution. Some steps which may be useful
* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching.
This can be done using the `spark.storage.memoryFraction` property. It is better to cache fewer objects than to slow
- down task execution !
+ down task execution!
-* If there are too many minor collections but not too many major GCs, allocating more memory for Eden would help. You
- can approximate the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden
-is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling
-up by 4/3 is to account for space used by survivor regions as well)
+* If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You
+ can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden
+ is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling
+ up by 4/3 is to account for space used by survivor regions as well.)
* As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using
the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the
@@ -221,10 +239,9 @@ Spark prints the serialized size of each task on the master, so you can look at
decide whether your tasks are too large; in general tasks larger than about 20 KB are probably
worth optimizing.
-
# Summary
-This has been a quick guide to point out the main concerns you should know about when tuning a
+This has been a short guide to point out the main concerns you should know about when tuning a
Spark application -- most importantly, data serialization and memory tuning. For most programs,
switching to Kryo serialization and persisting data in serialized form will solve most common
performance issues. Feel free to ask on the