Merge branch 'master' into ec2-updates

Conflicts: ec2/deploy.generic/root/mesos-ec2/ec2-variables.sh
author: Patrick Wendell <pwendell@gmail.com> 2013-07-31 21:35:12 -0700
committer: Patrick Wendell <pwendell@gmail.com> 2013-07-31 21:35:12 -0700
commit: 5cc725a0e3ef523affae8ff54dd74707e49d64e3 (patch)
tree: ebd1698333d2df4194f17a9ea93a2f2eac2c7acd /docs
parent: b7b627d5bb1a1331ea580950834533f84735df4c (diff)
parent: f3cf09491a2b63e19a15e98cf815da503e4fb69b (diff)
download: spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.gz
spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.bz2
spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.zip
9 files changed, 120 insertions, 40 deletions
diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb
index d77e53963c..45ef4bba82 100644
--- a/docs/_plugins/copy_api_dirs.rb
+++ b/docs/_plugins/copy_api_dirs.rb
@@ -1,3 +1,20 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
 require 'fileutils'
 include FileUtils
 
@@ -18,7 +35,7 @@ if ENV['SKIP_API'] != '1'
   # Copy over the scaladoc from each project into the docs directory.
   # This directory will be copied over to _site when `jekyll` command is run.
   projects.each do |project_name|
-    source = "../" + project_name + "/target/scala-2.9.2/api"
+    source = "../" + project_name + "/target/scala-2.9.3/api"
     dest = "api/" + project_name
 
     puts "echo making directory " + dest
diff --git a/docs/configuration.md b/docs/configuration.md
index 17fdbf04d1..5c06897cae 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -22,26 +22,30 @@ the copy executable.
 
 Inside `spark-env.sh`, you *must* set at least the following two variables:
 
-* `SCALA_HOME`, to point to your Scala installation.
+* `SCALA_HOME`, to point to your Scala installation, or `SCALA_LIBRARY_PATH` to point to the directory for Scala
+  library JARs (if you install Scala as a Debian or RPM package, there is no `SCALA_HOME`, but these libraries
+  are in a separate path, typically /usr/share/java; look for `scala-library.jar`).
 * `MESOS_NATIVE_LIBRARY`, if you are [running on a Mesos cluster](running-on-mesos.html).
 
-In addition, there are four other variables that control execution. These can be set *either in `spark-env.sh`
-or in each job's driver program*, because they will automatically be propagated to workers from the driver.
-For a multi-user environment, we recommend setting the in the driver program instead of `spark-env.sh`, so
-that different user jobs can use different amounts of memory, JVM options, etc.
+In addition, there are four other variables that control execution. These should be set *in the environment that
+launches the job's driver program* instead of `spark-env.sh`, because they will be automatically propagated to
+workers. Setting these per-job instead of in `spark-env.sh` ensures that different jobs can have different settings
+for these variables.
 
-* `SPARK_MEM`, to set the amount of memory used per node (this should be in the same format as the 
-   JVM's -Xmx option, e.g. `300m` or `1g`)
 * `SPARK_JAVA_OPTS`, to add JVM options. This includes any system properties that you'd like to pass with `-D`.
 * `SPARK_CLASSPATH`, to add elements to Spark's classpath.
 * `SPARK_LIBRARY_PATH`, to add search directories for native libraries.
+* `SPARK_MEM`, to set the amount of memory used per node. This should be in the same format as the 
+   JVM's -Xmx option, e.g. `300m` or `1g`. Note that this option will soon be deprecated in favor of
+   the `spark.executor.memory` system property, so we recommend using that in new code.
 
-Note that if you do set these in `spark-env.sh`, they will override the values set by user programs, which
-is undesirable; you can choose to have `spark-env.sh` set them only if the user program hasn't, as follows:
+Beware that if you do set these variables in `spark-env.sh`, they will override the values set by user programs,
+which is undesirable; if you prefer, you can choose to have `spark-env.sh` set them only if the user program
+hasn't, as follows:
 
 {% highlight bash %}
-if [ -z "$SPARK_MEM" ] ; then
-  SPARK_MEM="1g"
+if [ -z "$SPARK_JAVA_OPTS" ] ; then
+  SPARK_JAVA_OPTS="-verbose:gc"
 fi
 {% endhighlight %}
 
@@ -55,11 +59,18 @@ val sc = new SparkContext(...)
 {% endhighlight %}
 
 Most of the configurable system properties control internal settings that have reasonable default values. However,
-there are at least four properties that you will commonly want to control:
+there are at least five properties that you will commonly want to control:
 
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
+  <td>spark.executor.memory</td>
+  <td>512m</td>
+  <td>
+    Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. `512m`, `2g`).
+  </td>
+</tr>
+<tr>
   <td>spark.serializer</td>
   <td>spark.JavaSerializer</td>
   <td>
@@ -135,9 +146,16 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td>spark.ui.port</td>
-  <td>(random)</td>
+  <td>33000</td>
+  <td>
+    Port for your application's dashboard, which shows memory and workload data
+  </td>
+</tr>
+<tr>
+  <td>spark.ui.retained_stages</td>
+  <td>1000</td>
   <td>
-    Port for your application's dashboard, which shows memory usage of each RDD.
+    How many stages the Spark UI remembers before garbage collecting.
   </td>
 </tr>
 <tr>
@@ -180,8 +198,18 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
+  <td>spark.kryo.referenceTracking</td>
+  <td>true</td>
+  <td>
+    Whether to track references to the same object when serializing data with Kryo, which is
+    necessary if your object graphs have loops and useful for efficiency if they contain multiple
+    copies of the same object. Can be disabled to improve performance if you know this is not the
+    case.
+  </td>
+</tr>
+<tr>
   <td>spark.kryoserializer.buffer.mb</td>
-  <td>32</td>
+  <td>2</td>
   <td>
     Maximum object size to allow within Kryo (the library needs to create a buffer at least as
     large as the largest single object you'll serialize). Increase this if you get a "buffer limit
@@ -260,6 +288,13 @@ Apart from these, the following properties are also available, and may be useful
     applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
   </td>
 </tr>
+<tr>
+  <td>spark.streaming.blockInterval</td>
+  <td>200</td>
+  <td>
+    Duration (milliseconds) of how long to batch new objects coming from network receivers.
+  </td>
+</tr>
 
 </table>
 
diff --git a/docs/ec2-scripts.md b/docs/ec2-scripts.md
index bae41f9406..bd787e0e46 100644
--- a/docs/ec2-scripts.md
+++ b/docs/ec2-scripts.md
@@ -110,9 +110,8 @@ permissions on your private key file, you can run `launch` with the
 # Configuration
 
 You can edit `/root/spark/conf/spark-env.sh` on each machine to set Spark configuration options, such
-as JVM options and, most crucially, the amount of memory to use per machine (`SPARK_MEM`).
-This file needs to be copied to **every machine** to reflect the change. The easiest way to do this
-is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master, 
+as JVM options. This file needs to be copied to **every machine** to reflect the change. The easiest way to
+do this is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master, 
 then run `~/spark-ec2/copy-dir /root/spark/conf` to RSYNC it to all the workers.
 
 The [configuration guide](configuration.html) describes the available configuration options.
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
index 3a7a8db4a6..e8aaac74d0 100644
--- a/docs/python-programming-guide.md
+++ b/docs/python-programming-guide.md
@@ -17,24 +17,23 @@ There are a few key differences between the Python and Scala APIs:
 * Python is dynamically typed, so RDDs can hold objects of different types.
 * PySpark does not currently support the following Spark features:
     - Special functions on RDDs of doubles, such as `mean` and `stdev`
-    - `lookup`
+    - `lookup`, `sample` and `sort`
     - `persist` at storage levels other than `MEMORY_ONLY`
-    - `sample`
-    - `sort`
+    - Execution on Windows -- this is slated for a future release
 
 In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types.
 Short functions can be passed to RDD methods using Python's [`lambda`](http://www.diveintopython.net/power_of_introspection/lambda_functions.html) syntax:
 
 {% highlight python %}
 logData = sc.textFile(logFile).cache()
-errors = logData.filter(lambda s: 'ERROR' in s.split())
+errors = logData.filter(lambda line: "ERROR" in line)
 {% endhighlight %}
 
 You can also pass functions that are defined using the `def` keyword; this is useful for more complicated functions that cannot be expressed using `lambda`:
 
 {% highlight python %}
 def is_error(line):
-    return 'ERROR' in line.split()
+    return "ERROR" in line
 errors = logData.filter(is_error)
 {% endhighlight %}
 
@@ -43,8 +42,7 @@ Functions can access objects in enclosing scopes, although modifications to thos
 {% highlight python %}
 error_keywords = ["Exception", "Error"]
 def is_error(line):
-     words = line.split()
-     return any(keyword in words for keyword in error_keywords)
+    return any(keyword in line for keyword in error_keywords)
 errors = logData.filter(is_error)
 {% endhighlight %}
 
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 26424bbe52..66fb8d73e8 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -11,14 +11,34 @@ Ex:  mvn -Phadoop2-yarn clean install
 
 # Building spark core consolidated jar.
 
-Currently, only sbt can buid a consolidated jar which contains the entire spark code - which is required for launching jars on yarn.
-To do this via sbt - though (right now) is a manual process of enabling it in project/SparkBuild.scala.
+We need a consolidated spark core jar (which bundles all the required dependencies) to run Spark jobs on a yarn cluster.
+This can be built either through sbt or via maven.
+
+-   Building spark assembled jar via sbt.
+    It is a manual process of enabling it in project/SparkBuild.scala.
 Please comment out the
   HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
 variables before the line 'For Hadoop 2 YARN support'
 Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
 
-Currnetly, it is a TODO to add support for maven assembly.
+Assembly of the jar Ex:
+
+    ./sbt/sbt clean assembly
+
+The assembled jar would typically be something like :
+`./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar`
+
+
+-   Building spark assembled jar via Maven.
+    Use the hadoop2-yarn profile and execute the package target.
+
+Something like this. Ex:
+
+    mvn -Phadoop2-yarn clean package -DskipTests=true
+
+
+This will build the shaded (consolidated) jar. Typically something like :
+`./repl-bin/target/spark-repl-bin-<VERSION>-shaded-hadoop2-yarn.jar`
 
 
 # Preparations
@@ -30,6 +50,9 @@ If you want to test out the YARN deployment mode, you can use the current Spark
 
 # Launching Spark on YARN
 
+Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster.
+This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager.
+
 The command to launch the YARN Client is as follows:
 
     SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \
@@ -48,7 +71,7 @@ For example:
     SPARK_JAR=./core/target/spark-core-assembly-{{site.SPARK_VERSION}}.jar ./run spark.deploy.yarn.Client \
       --jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}.jar \
       --class spark.examples.SparkPi \
-      --args standalone \
+      --args yarn-standalone \
       --num-workers 3 \
       --master-memory 4g \
       --worker-memory 2g \
@@ -58,7 +81,7 @@ The above starts a YARN Client programs which periodically polls the Application
 
 # Important Notes
 
-- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
-- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
+- When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above.
+- We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
 - Currently, we have not yet integrated with hadoop security. If --user is present, the hadoop_user specified will be used to run the tasks on the cluster. If unspecified, current user will be used (which should be valid in cluster).
   Once hadoop security support is added, and if hadoop cluster is enabled with security, additional restrictions would apply via delegation tokens passed.
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index 2315aadbdf..e9cf9ef36f 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -43,12 +43,18 @@ new SparkContext(master, appName, [sparkHome], [jars])
 
 The `master` parameter is a string specifying a [Spark or Mesos cluster URL](#master-urls) to connect to, or a special "local" string to run in local mode, as described below. `appName` is a name for your application, which will be shown in the cluster web UI. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later.
 
-In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable. For example, to run on four cores, use
+In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable, and you can add JARs to the classpath with the `ADD_JARS` variable. For example, to run `spark-shell` on four cores, use
 
 {% highlight bash %}
 $ MASTER=local[4] ./spark-shell
 {% endhighlight %}
 
+Or, to also add `code.jar` to its classpath, use:
+
+{% highlight bash %}
+$ MASTER=local[4] ADD_JARS=code.jar ./spark-shell
+{% endhighlight %}
+
 ### Master URLs
 
 The master URL passed to Spark can be in one of the following formats:
@@ -67,6 +73,8 @@ The master URL passed to Spark can be in one of the following formats:
 </td></tr>
 </table>
 
+If no master URL is specified, the spark shell defaults to "local".
+
 For running on YARN, Spark launches an instance of the standalone deploy cluster within YARN; see [running on YARN](running-on-yarn.html) for details.
 
 ### Deploying Code on a Cluster
@@ -76,7 +84,7 @@ If you want to run your job on a cluster, you will need to specify the two optio
 * `sparkHome`: The path at which Spark is installed on your worker machines (it should be the same on all of them).
 * `jars`: A list of JAR files on the local machine containing your job's code and any dependencies, which Spark will deploy to all the worker nodes. You'll need to package your job into a set of JARs using your build system. For example, if you're using SBT, the [sbt-assembly](https://github.com/sbt/sbt-assembly) plugin is a good way to make a single JAR with your code and dependencies.
 
-If you run `spark-shell` on a cluster, any classes you define in the shell will automatically be distributed.
+If you run `spark-shell` on a cluster, you can add JARs to it by specifying the `ADD_JARS` environment variable before you launch it.  This variable should contain a comma-separated list of JARs. For example, `ADD_JARS=a.jar,b.jar ./spark-shell` will launch a shell with `a.jar` and `b.jar` on its classpath. In addition, any new classes you define in the shell will automatically be distributed.
 
 
 # Resilient Distributed Datasets (RDDs)
diff --git a/docs/spark-simple-tutorial.md b/docs/spark-simple-tutorial.md
index 9875de62bd..fbdbc7d19d 100644
--- a/docs/spark-simple-tutorial.md
+++ b/docs/spark-simple-tutorial.md
@@ -13,7 +13,7 @@ title: Tutorial - Running a Simple Spark Application
 
 3. Edit the ~/SparkTest/sbt/sbt file to look like this:
 
-        #!/bin/bash
+        #!/usr/bin/env bash
         java -Xmx800M -XX:MaxPermSize=150m -jar $(dirname $0)/sbt-launch-*.jar "$@"
 
 4. To build a Spark application, you need Spark and its dependencies in a single Java archive (JAR) file. Create this JAR in Spark's main directory with sbt as:
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index f5788dc467..8cd1b0cd66 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -7,7 +7,7 @@ title: Spark Streaming Programming Guide
 {:toc}
 
 # Overview
-A Spark Streaming application is very similar to a Spark application; it consists of a *driver program* that runs the user's `main` function and continuous executes various *parallel operations* on input streams of data. The main abstraction Spark Streaming provides is a *discretized stream* (DStream), which is a continuous sequence of RDDs (distributed collections of elements) representing a continuous stream of data. DStreams can be created from live incoming data (such as data from a socket, Kafka, etc.) or can be generated by transformong existing DStreams using parallel operators like `map`, `reduce`, and `window`. The basic processing model is as follows: 
+A Spark Streaming application is very similar to a Spark application; it consists of a *driver program* that runs the user's `main` function and continuous executes various *parallel operations* on input streams of data. The main abstraction Spark Streaming provides is a *discretized stream* (DStream), which is a continuous sequence of RDDs (distributed collections of elements) representing a continuous stream of data. DStreams can be created from live incoming data (such as data from a socket, Kafka, etc.) or can be generated by transforming existing DStreams using parallel operators like `map`, `reduce`, and `window`. The basic processing model is as follows: 
 (i) While a Spark Streaming driver program is running, the system receives data from various sources and and divides it into batches. Each batch of data is treated as an RDD, that is, an immutable parallel collection of data. These input RDDs are saved in memory and replicated to two nodes for fault-tolerance. This sequence of RDDs is collectively called an InputDStream.
 (ii) Data received by InputDStreams are processed using DStream operations. Since all data is represented as RDDs and all DStream operations as RDD operations, data is automatically recovered in the event of node failures.  
 
@@ -20,7 +20,7 @@ The first thing a Spark Streaming program must do is create a `StreamingContext`
 new StreamingContext(master, appName, batchDuration, [sparkHome], [jars])
 {% endhighlight %}
 
-The `master` parameter is a standard [Spark cluster URL](scala-programming-guide.html#master-urls) and can be "local" for local testing. The `appName` is a name of your program, which will be shown on your cluster's web UI. The `batchDuration` is the size of the batches (as explained earlier). This must be set carefully such the cluster can keep up with the processing of the data streams. Start with something conservative like 5 seconds. See the [Performance Tuning](#setting-the-right-batch-size) section for a detailed discussion. Finally, `sparkHome` and `jars` are necessary when running on a cluster to specify the location of your code, as described in the [Spark programming guide](scala-programming-guide.html#deploying-code-on-a-cluster).
+The `master` parameter is a standard [Spark cluster URL](scala-programming-guide.html#master-urls) and can be "local" for local testing. The `appName` is a name of your program, which will be shown on your cluster's web UI. The `batchDuration` is the size of the batches (as explained earlier). This must be set carefully such that the cluster can keep up with the processing of the data streams. Start with something conservative like 5 seconds. See the [Performance Tuning](#setting-the-right-batch-size) section for a detailed discussion. Finally, `sparkHome` and `jars` are necessary when running on a cluster to specify the location of your code, as described in the [Spark programming guide](scala-programming-guide.html#deploying-code-on-a-cluster).
 
 This constructor creates a SparkContext for your job as well, which can be accessed with `streamingContext.sparkContext`.
 
diff --git a/docs/tuning.md b/docs/tuning.md
index 32c7ab86e9..5ffca54481 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -157,9 +157,9 @@ their work directories), *not* on your driver program.
 
 **Cache Size Tuning**
 
-One important configuration parameter for GC is the amount of memory that should be used for
-caching RDDs. By default, Spark uses 66% of the configured memory (`SPARK_MEM`) to cache RDDs. This means that
- 33% of memory is available for any objects created during task execution.
+One important configuration parameter for GC is the amount of memory that should be used for caching RDDs.
+By default, Spark uses 66% of the configured executor memory (`spark.executor.memory` or `SPARK_MEM`) to
+cache RDDs. This means that 33% of memory is available for any objects created during task execution.
 
 In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of
 memory, lowering this value will help reduce the memory consumption. To change this to say 50%, you can call
author	Patrick Wendell <pwendell@gmail.com>	2013-07-31 21:35:12 -0700
committer	Patrick Wendell <pwendell@gmail.com>	2013-07-31 21:35:12 -0700
commit	5cc725a0e3ef523affae8ff54dd74707e49d64e3 (patch)
tree	ebd1698333d2df4194f17a9ea93a2f2eac2c7acd /docs
parent	b7b627d5bb1a1331ea580950834533f84735df4c (diff)
parent	f3cf09491a2b63e19a15e98cf815da503e4fb69b (diff)
download	spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.gz spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.tar.bz2 spark-5cc725a0e3ef523affae8ff54dd74707e49d64e3.zip