More updates, describing changes to recommended use of environment vars

and new Python stuff
author: Matei Zaharia <matei@eecs.berkeley.edu> 2013-08-31 14:21:10 -0700
committer: Matei Zaharia <matei@eecs.berkeley.edu> 2013-08-31 14:21:10 -0700
commit: 4819baa658a6c8a3e4c5c504af284ea6091e4c35 (patch)
tree: 00eda629ac7292487ef14f858d19297c38a19607 /docs
parent: 4293533032bd5c354bb011f8d508b99615c6e0f0 (diff)
download: spark-4819baa658a6c8a3e4c5c504af284ea6091e4c35.tar.gz
spark-4819baa658a6c8a3e4c5c504af284ea6091e4c35.tar.bz2
spark-4819baa658a6c8a3e4c5c504af284ea6091e4c35.zip
7 files changed, 74 insertions, 80 deletions
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 91a4a2eaee..84749fda4e 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -66,7 +66,7 @@
                                 <li><a href="python-programming-guide.html">Spark in Python</a></li>
                                 <li class="divider"></li>
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
-                                <li><a href="mllib-programming-guide.html">MLlib (Machine Learning)</a></li>
+                                <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
                                 <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
                             </ul>
                         </li>
diff --git a/docs/configuration.md b/docs/configuration.md
index b125eeb03c..1c0492efb3 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -5,50 +5,14 @@ title: Spark Configuration
 
 Spark provides three main locations to configure the system:
 
-* [Environment variables](#environment-variables) for launching Spark workers, which can
-  be set either in your driver program or in the `conf/spark-env.sh` script.
-* [Java system properties](#system-properties), which control internal configuration parameters and can be set either
-  programmatically (by calling `System.setProperty` *before* creating a `SparkContext`) or through the
-  `SPARK_JAVA_OPTS` environment variable in `spark-env.sh`.
+* [Java system properties](#system-properties), which control internal configuration parameters and can be set
+  either programmatically (by calling `System.setProperty` *before* creating a `SparkContext`) or through
+  JVM arguments.
+* [Environment variables](#environment-variables) for configuring per-machine settings such as the IP address,
+  which can be set in the `conf/spark-env.sh` script.
 * [Logging configuration](#configuring-logging), which is done through `log4j.properties`.
 
 
-# Environment Variables
-
-Spark determines how to initialize the JVM on worker nodes, or even on the local node when you run `spark-shell`,
-by running the `conf/spark-env.sh` script in the directory where it is installed. This script does not exist by default
-in the Git repository, but but you can create it by copying `conf/spark-env.sh.template`. Make sure that you make
-the copy executable.
-
-Inside `spark-env.sh`, you *must* set at least the following two variables:
-
-* `SCALA_HOME`, to point to your Scala installation, or `SCALA_LIBRARY_PATH` to point to the directory for Scala
-  library JARs (if you install Scala as a Debian or RPM package, there is no `SCALA_HOME`, but these libraries
-  are in a separate path, typically /usr/share/java; look for `scala-library.jar`).
-* `MESOS_NATIVE_LIBRARY`, if you are [running on a Mesos cluster](running-on-mesos.html).
-
-In addition, there are four other variables that control execution. These should be set *in the environment that
-launches the job's driver program* instead of `spark-env.sh`, because they will be automatically propagated to
-workers. Setting these per-job instead of in `spark-env.sh` ensures that different jobs can have different settings
-for these variables.
-
-* `SPARK_JAVA_OPTS`, to add JVM options. This includes any system properties that you'd like to pass with `-D`.
-* `SPARK_CLASSPATH`, to add elements to Spark's classpath.
-* `SPARK_LIBRARY_PATH`, to add search directories for native libraries.
-* `SPARK_MEM`, to set the amount of memory used per node. This should be in the same format as the
-   JVM's -Xmx option, e.g. `300m` or `1g`. Note that this option will soon be deprecated in favor of
-   the `spark.executor.memory` system property, so we recommend using that in new code.
-
-Beware that if you do set these variables in `spark-env.sh`, they will override the values set by user programs,
-which is undesirable; if you prefer, you can choose to have `spark-env.sh` set them only if the user program
-hasn't, as follows:
-
-{% highlight bash %}
-if [ -z "$SPARK_JAVA_OPTS" ] ; then
-  SPARK_JAVA_OPTS="-verbose:gc"
-fi
-{% endhighlight %}
-
 # System Properties
 
 To set a system property for configuring Spark, you need to either pass it with a -D flag to the JVM (for example `java -Dspark.cores.max=5 MyProgram`) or call `System.setProperty` in your code *before* creating your Spark context, as follows:
@@ -67,7 +31,7 @@ there are at least five properties that you will commonly want to control:
   <td>spark.executor.memory</td>
   <td>512m</td>
   <td>
-    Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. `512m`, `2g`).
+    Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
   </td>
 </tr>
 <tr>
@@ -78,7 +42,7 @@ there are at least five properties that you will commonly want to control:
     in serialized form. The default of Java serialization works with any Serializable Java object but is
     quite slow, so we recommend <a href="tuning.html">using <code>spark.KryoSerializer</code>
     and configuring Kryo serialization</a> when speed is necessary. Can be any subclass of
-    <a href="api/core/index.html#spark.Serializer"><code>spark.Serializer</code></a>).
+    <a href="api/core/index.html#spark.Serializer"><code>spark.Serializer</code></a>.
   </td>
 </tr>
 <tr>
@@ -87,7 +51,7 @@ there are at least five properties that you will commonly want to control:
   <td>
     If you use Kryo serialization, set this class to register your custom classes with Kryo.
     You need to set it to a class that extends
-    <a href="api/core/index.html#spark.KryoRegistrator"><code>spark.KryoRegistrator</code></a>).
+    <a href="api/core/index.html#spark.KryoRegistrator"><code>spark.KryoRegistrator</code></a>.
     See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
   </td>
 </tr>
@@ -97,7 +61,7 @@ there are at least five properties that you will commonly want to control:
   <td>
     Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored
     on disk. This should be on a fast, local disk in your system. It can also be a comma-separated
-    list of multiple directories.
+    list of multiple directories on different disks.
   </td>
 </tr>
 <tr>
@@ -106,7 +70,8 @@ there are at least five properties that you will commonly want to control:
   <td>
     When running on a <a href="spark-standalone.html">standalone deploy cluster</a> or a
     <a href="running-on-mesos.html#mesos-run-modes">Mesos cluster in "coarse-grained"
-    sharing mode</a>, how many CPU cores to request at most. The default will use all available cores.
+    sharing mode</a>, how many CPU cores to request at most. The default will use all available cores
+    offered by the cluster manager.
   </td>
 </tr>
 </table>
@@ -321,7 +286,7 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td>spark.cleaner.ttl</td>
-  <td>(disable)</td>
+  <td>(infinite)</td>
   <td>
     Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
     Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is
@@ -347,6 +312,32 @@ Apart from these, the following properties are also available, and may be useful
 
 </table>
 
+# Environment Variables
+
+Certain Spark settings can also be configured through environment variables, which are read from the `conf/spark-env.sh`
+script in the directory where Spark is installed. These variables are meant to be for machine-specific settings, such
+as library search paths. While Java system properties can also be set here, for application settings, we recommend setting
+these properties within the application instead of in `spark-env.sh` so that different applications can use different
+settings.
+
+Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can copy
+`conf/spark-env.sh.template` to create it. Make sure you make the copy executable.
+
+The following variables can be set in `spark-env.sh`:
+
+* `SPARK_LOCAL_IP`, to configure which IP address of the machine to bind to.
+* `SPARK_LIBRARY_PATH`, to add search directories for native libraries.
+* `SPARK_CLASSPATH`, to add elements to Spark's classpath that you want to be present for _all_ applications.
+   Note that applications can also add dependencies for themselves through `SparkContext.addJar` -- we recommend
+   doing that when possible.
+* `SPARK_JAVA_OPTS`, to add JVM options. This includes Java options like garbage collector settings and any system
+   properties that you'd like to pass with `-D` (e.g., `-Dspark.local.dir=/disk1,/disk2`). 
+* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores
+  to use on each machine and maximum memory.
+
+Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
+compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
+
 # Configuring Logging
 
 Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties`
diff --git a/docs/index.md b/docs/index.md
index cb51d4cadc..bcd7dad6ae 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -58,7 +58,7 @@ set `SPARK_YARN`:
   * [Java Programming Guide](java-programming-guide.html): using Spark from Java
   * [Python Programming Guide](python-programming-guide.html): using Spark from Python
 * [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming
-* [MLlib (Machine Learning)](mllib-programming-guide.html): Spark's built-in machine learning library
+* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
 * [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
 
 **API Docs:**
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
new file mode 100644
index 0000000000..c897f8b36c
--- /dev/null
+++ b/docs/mllib-guide.md
@@ -0,0 +1,6 @@
+---
+layout: global
+title: Machine Learning Library (MLlib)
+---
+
+Coming soon.
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
index 15d3ebfcae..27e0d10080 100644
--- a/docs/python-programming-guide.md
+++ b/docs/python-programming-guide.md
@@ -17,8 +17,8 @@ There are a few key differences between the Python and Scala APIs:
 
 * Python is dynamically typed, so RDDs can hold objects of different types.
 * PySpark does not currently support the following Spark features:
-    - Special functions on RDDs of doubles, such as `mean` and `stdev`
-    - `lookup`, `sample` and `sort`
+    - `lookup`
+    - `sort`
     - `persist` at storage levels other than `MEMORY_ONLY`
     - Execution on Windows -- this is slated for a future release
 
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
index f4a3eb667c..b31f78e8bf 100644
--- a/docs/running-on-mesos.md
+++ b/docs/running-on-mesos.md
@@ -9,9 +9,8 @@ Spark can run on private clusters managed by the [Apache Mesos](http://incubator
 2. Download Mesos {{site.MESOS_VERSION}} from a [mirror](http://www.apache.org/dyn/closer.cgi/incubator/mesos/mesos-{{site.MESOS_VERSION}}/).
 3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.macosx`, that you can run. See the README file in Mesos for other options. **Note:** If you want to run Mesos without installing it into the default paths on your system (e.g. if you don't have administrative privileges to install it), you should also pass the `--prefix` option to `configure` to tell it where to install. For example, pass `--prefix=/home/user/mesos`. By default the prefix is `/usr/local`.
 4. Build Mesos using `make`, and then install it using `make install`.
-5. Create a file called `spark-env.sh` in Spark's `conf` directory, by copying `conf/spark-env.sh.template`, and add the following lines in it:
+5. Create a file called `spark-env.sh` in Spark's `conf` directory, by copying `conf/spark-env.sh.template`, and add the following lines it:
    * `export MESOS_NATIVE_LIBRARY=<path to libmesos.so>`. This path is usually `<prefix>/lib/libmesos.so` (where the prefix is `/usr/local` by default). Also, on Mac OS X, the library is called `libmesos.dylib` instead of `.so`.
-   * `export SCALA_HOME=<path to Scala directory>`.
 6. Copy Spark and Mesos to the _same_ paths on all the nodes in the cluster (or, for Mesos, `make install` on every node).
 7. Configure Mesos for deployment:
    * On your master node, edit `<prefix>/var/mesos/deploy/masters` to list your master and `<prefix>/var/mesos/deploy/slaves` to list the slaves, where `<prefix>` is the prefix where you installed Mesos (`/usr/local` by default).
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index bb8be276c5..9ab6ba0830 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -3,18 +3,7 @@ layout: global
 title: Spark Standalone Mode
 ---
 
-{% comment %}
-TODO(andyk):
-  - Add a table of contents
-  - Move configuration towards the end so that it doesn't come first
-  - Say the scripts will guess the resource amounts (i.e. # cores) automatically
-{% endcomment %}
-
-In addition to running on top of [Mesos](https://github.com/mesos/mesos), Spark also supports a standalone mode, consisting of one Spark master and several Spark worker processes. You can run the Spark standalone mode either locally (for testing) or on a cluster. If you wish to run on a cluster, we have provided [a set of deploy scripts](#cluster-launch-scripts) to launch a whole cluster.
-
-# Getting Started
-
-Compile Spark with `sbt package` as described in the [Getting Started Guide](index.html). You do not need to install Mesos on your machine if you are using the standalone mode.
+In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided [deploy scripts](#cluster-launch-scripts). It is also possible to run these daemons on a single machine for testing.
 
 # Starting a Cluster Manually
 
@@ -22,8 +11,8 @@ You can start a standalone master server by executing:
 
     ./spark-class spark.deploy.master.Master
 
-Once started, the master will print out a `spark://IP:PORT` URL for itself, which you can use to connect workers to it,
-or pass as the "master" argument to `SparkContext` to connect a job to the cluster. You can also find this URL on
+Once started, the master will print out a `spark://HOST:PORT` URL for itself, which you can use to connect workers to it,
+or pass as the "master" argument to `SparkContext`. You can also find this URL on
 the master's web UI, which is [http://localhost:8080](http://localhost:8080) by default.
 
 Similarly, you can start one or more workers and connect them to the master via:
@@ -68,7 +57,7 @@ Finally, the following configuration options can be passed to the master and wor
 
 To launch a Spark standalone cluster with the deploy scripts, you need to create a file called `conf/slaves` in your Spark directory, which should contain the hostnames of all the machines where you would like to start Spark workers, one per line. The master machine must be able to access each of the slave machines via password-less `ssh` (using a private key). For testing, you can just put `localhost` in this file.
 
-Once you've set up this fine, you can launch or stop your cluster with the following shell scripts, based on Hadoop's deploy scripts, and available in `SPARK_HOME/bin`:
+Once you've set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop's deploy scripts, and available in `SPARK_HOME/bin`:
 
 - `bin/start-master.sh` - Starts a master instance on the machine the script is executed on.
 - `bin/start-slaves.sh` - Starts a slave instance on each machine specified in the `conf/slaves` file.
@@ -85,47 +74,56 @@ You can optionally configure the cluster further by setting environment variable
   <tr><th style="width:21%">Environment Variable</th><th>Meaning</th></tr>
   <tr>
     <td><code>SPARK_MASTER_IP</code></td>
-    <td>Bind the master to a specific IP address, for example a public one</td>
+    <td>Bind the master to a specific IP address, for example a public one.</td>
   </tr>
   <tr>
     <td><code>SPARK_MASTER_PORT</code></td>
-    <td>Start the master on a different port (default: 7077)</td>
+    <td>Start the master on a different port (default: 7077).</td>
   </tr>
   <tr>
     <td><code>SPARK_MASTER_WEBUI_PORT</code></td>
-    <td>Port for the master web UI (default: 8080)</td>
+    <td>Port for the master web UI (default: 8080).</td>
   </tr>
   <tr>
     <td><code>SPARK_WORKER_PORT</code></td>
-    <td>Start the Spark worker on a specific port (default: random)</td>
+    <td>Start the Spark worker on a specific port (default: random).</td>
   </tr>
   <tr>
     <td><code>SPARK_WORKER_DIR</code></td>
-    <td>Directory to run jobs in, which will include both logs and scratch space (default: SPARK_HOME/work)</td>
+    <td>Directory to run jobs in, which will include both logs and scratch space (default: SPARK_HOME/work).</td>
   </tr>
   <tr>
     <td><code>SPARK_WORKER_CORES</code></td>
-    <td>Total number of cores to allow Spark jobs to use on the machine (default: all available cores)</td>
+    <td>Total number of cores to allow Spark jobs to use on the machine (default: all available cores).</td>
   </tr>
   <tr>
     <td><code>SPARK_WORKER_MEMORY</code></td>
-    <td>Total amount of memory to allow Spark jobs to use on the machine, e.g. 1000M, 2G (default: total memory minus 1 GB); note that each job's <i>individual</i> memory is configured using <code>SPARK_MEM</code></td>
+    <td>Total amount of memory to allow Spark jobs to use on the machine, e.g. <code>1000m</code>, <code>2g</code> (default: total memory minus 1 GB); note that each job's <i>individual</i> memory is configured using its <code>spark.executor.memory</code> property.</td>
   </tr>
   <tr>
     <td><code>SPARK_WORKER_WEBUI_PORT</code></td>
-    <td>Port for the worker web UI (default: 8081)</td>
+    <td>Port for the worker web UI (default: 8081).</td>
+  </tr>
+  <tr>
+    <td><code>SPARK_WORKER_INSTANCES</code></td>
+    <td>
+      Number of worker instances to run on each machine (default: 1). You can make this more than 1 if
+      you have have very large machines and would like multiple Spark worker processes. If you do set
+      this, make sure to also set <code>SPARK_WORKER_CORES</code> explicitly to limit the cores per worker,
+      or else each worker will try to use all the cores.
+    </td>
   </tr>
   <tr>
     <td><code>SPARK_DAEMON_MEMORY</code></td>
-    <td>Memory to allocate to the Spark master and worker daemons themselves (default: 512m)</td>
+    <td>Memory to allocate to the Spark master and worker daemons themselves (default: 512m).</td>
   </tr>
   <tr>
     <td><code>SPARK_DAEMON_JAVA_OPTS</code></td>
-    <td>JVM options for the Spark master and worker daemons themselves (default: none)</td>
+    <td>JVM options for the Spark master and worker daemons themselves (default: none).</td>
   </tr>
 </table>
 
-
+**Note:** The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.
 
 # Connecting a Job to the Cluster
 
@@ -155,5 +153,5 @@ In addition, detailed log output for each job is also written to the work direct
 
 # Running Alongside Hadoop
 
-You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
+You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
author	Matei Zaharia <matei@eecs.berkeley.edu>	2013-08-31 14:21:10 -0700
committer	Matei Zaharia <matei@eecs.berkeley.edu>	2013-08-31 14:21:10 -0700
commit	4819baa658a6c8a3e4c5c504af284ea6091e4c35 (patch)
tree	00eda629ac7292487ef14f858d19297c38a19607 /docs
parent	4293533032bd5c354bb011f8d508b99615c6e0f0 (diff)
download	spark-4819baa658a6c8a3e4c5c504af284ea6091e4c35.tar.gz spark-4819baa658a6c8a3e4c5c504af284ea6091e4c35.tar.bz2 spark-4819baa658a6c8a3e4c5c504af284ea6091e4c35.zip