aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--conf/spark-defaults.conf.template3
-rwxr-xr-xconf/spark-env.sh.template4
-rw-r--r--docs/building-with-maven.md7
-rw-r--r--docs/cluster-overview.md73
-rw-r--r--docs/configuration.md64
-rw-r--r--docs/hadoop-third-party-distributions.md14
-rw-r--r--docs/index.md34
-rw-r--r--docs/java-programming-guide.md5
-rw-r--r--docs/python-programming-guide.md2
-rw-r--r--docs/quick-start.md4
-rw-r--r--docs/running-on-yarn.md15
-rw-r--r--docs/scala-programming-guide.md13
-rw-r--r--docs/spark-standalone.md71
13 files changed, 184 insertions, 125 deletions
diff --git a/conf/spark-defaults.conf.template b/conf/spark-defaults.conf.template
index f840ff681d..2779342769 100644
--- a/conf/spark-defaults.conf.template
+++ b/conf/spark-defaults.conf.template
@@ -2,6 +2,7 @@
# This is useful for setting default environmental settings.
# Example:
-# spark.master spark://master:7077
+# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
+# spark.serializer org.apache.spark.serializer.KryoSerializer
diff --git a/conf/spark-env.sh.template b/conf/spark-env.sh.template
index 4479e1e34c..f8ffbf6427 100755
--- a/conf/spark-env.sh.template
+++ b/conf/spark-env.sh.template
@@ -30,11 +30,11 @@
# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
-# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
+# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
-# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
+# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
index b6dd553bbe..8b44535d82 100644
--- a/docs/building-with-maven.md
+++ b/docs/building-with-maven.md
@@ -129,6 +129,13 @@ Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spi
For these tests to run your system must have a JDK 8 installation.
If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.
+## Building for PySpark on YARN ##
+
+PySpark on YARN is only supported if the jar is built with maven. Further, there is a known problem
+with building this assembly jar on Red Hat based operating systems (see SPARK-1753). If you wish to
+run PySpark on a YARN cluster with Red Hat installed, we recommend that you build the jar elsewhere,
+then ship it over to the cluster. We are investigating the exact cause for this.
+
## Packaging without Hadoop dependencies for deployment on YARN ##
The assembly jar produced by "mvn package" will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The "hadoop-provided" profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
index 162c415b58..f05a755de7 100644
--- a/docs/cluster-overview.md
+++ b/docs/cluster-overview.md
@@ -66,62 +66,76 @@ script as shown here while passing your jar.
For Python, you can use the `pyFiles` argument of SparkContext
or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.
-### Launching Applications with ./bin/spark-submit
+### Launching Applications with Spark submit
Once a user application is bundled, it can be launched using the `spark-submit` script located in
the bin directory. This script takes care of setting up the classpath with Spark and its
-dependencies, and can support different cluster managers and deploy modes that Spark supports.
-It's usage is
+dependencies, and can support different cluster managers and deploy modes that Spark supports:
- ./bin/spark-submit --class path.to.your.Class [options] <app jar> [app options]
+ ./bin/spark-submit \
+ --class <main-class>
+ --master <master-url> \
+ --deploy-mode <deploy-mode> \
+ ... // other options
+ <application-jar>
+ [application-arguments]
-When calling `spark-submit`, `[app options]` will be passed along to your application's
-main class. To enumerate all options available to `spark-submit` run it with
-the `--help` flag. Here are a few examples of common options:
+ main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
+ master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
+ deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
+ application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
+ application-arguments: Space delimited arguments passed to the main method of <main-class>, if any
+
+To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few
+examples of common options:
{% highlight bash %}
# Run application locally
./bin/spark-submit \
- --class my.main.ClassName
+ --class org.apache.spark.examples.SparkPi
--master local[8] \
- my-app.jar
+ /path/to/examples.jar \
+ 100
# Run on a Spark standalone cluster
./bin/spark-submit \
- --class my.main.ClassName
- --master spark://mycluster:7077 \
+ --class org.apache.spark.examples.SparkPi
+ --master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
- my-app.jar
+ /path/to/examples.jar \
+ 1000
# Run on a YARN cluster
-HADOOP_CONF_DIR=XX /bin/spark-submit \
- --class my.main.ClassName
+HADOOP_CONF_DIR=XX ./bin/spark-submit \
+ --class org.apache.spark.examples.SparkPi
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
- my-app.jar
+ /path/to/examples.jar \
+ 1000
{% endhighlight %}
### Loading Configurations from a File
-The `spark-submit` script can load default `SparkConf` values from a properties file and pass them
-onto your application. By default it will read configuration options from
-`conf/spark-defaults.conf`. Any values specified in the file will be passed on to the
-application when run. They can obviate the need for certain flags to `spark-submit`: for
-instance, if `spark.master` property is set, you can safely omit the
+The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
+properties file and pass them on to your application. By default it will read configuration options
+from `conf/spark-defaults.conf`. For more detail, see the section on
+[loading default configurations](configuration.html#loading-default-configurations).
+
+Loading default Spark configurations this way can obviate the need for certain flags to
+`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
-`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values
-in the defaults file.
+`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
+defaults file.
-If you are ever unclear where configuration options are coming from. fine-grained debugging
-information can be printed by adding the `--verbose` option to `./spark-submit`.
+If you are ever unclear where configuration options are coming from, you can print out fine-grained
+debugging information by running `spark-submit` with the `--verbose` option.
### Advanced Dependency Management
-When using `./bin/spark-submit` the app jar along with any jars included with the `--jars` option
-will be automatically transferred to the cluster. `--jars` can also be used to distribute .egg and .zip
-libraries for Python to executors. Spark uses the following URL scheme to allow different
-strategies for disseminating jars:
+When using `spark-submit`, the application jar along with any jars included with the `--jars` option
+will be automatically transferred to the cluster. Spark uses the following URL scheme to allow
+different strategies for disseminating jars:
- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
every executor pulls the file from the driver HTTP server.
@@ -135,6 +149,9 @@ This can use up a significant amount of space over time and will need to be clea
is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
`spark.worker.cleanup.appDataTtl` property.
+For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries
+to executors.
+
# Monitoring
Each driver program has a web UI, typically on port 4040, that displays information about running
diff --git a/docs/configuration.md b/docs/configuration.md
index 5b034e3cb3..2eed96f704 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -5,9 +5,9 @@ title: Spark Configuration
Spark provides three locations to configure the system:
-* [Spark properties](#spark-properties) control most application parameters and can be set by passing
- a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
- system properties.
+* [Spark properties](#spark-properties) control most application parameters and can be set by
+ passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
+ or through the `conf/spark-defaults.conf` properties file.
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
the IP address, through the `conf/spark-env.sh` script on each node.
* [Logging](#configuring-logging) can be configured through `log4j.properties`.
@@ -15,25 +15,41 @@ Spark provides three locations to configure the system:
# Spark Properties
-Spark properties control most application settings and are configured separately for each application.
-The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
-class to your SparkContext constructor.
-Alternatively, Spark will also load them from Java system properties, for compatibility with old versions
-of Spark.
-
-SparkConf lets you configure most of the common properties to initialize a cluster (e.g., master URL and
-application name), as well as arbitrary key-value pairs through the `set()` method. For example, we could
-initialize an application as follows:
+Spark properties control most application settings and are configured separately for each
+application. The preferred way is to set them through
+[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your
+SparkContext. SparkConf allows you to configure most of the common properties to initialize a
+cluster (e.g. master URL and application name), as well as arbitrary key-value pairs through the
+`set()` method. For example, we could initialize an application as follows:
{% highlight scala %}
-val conf = new SparkConf().
- setMaster("local").
- setAppName("My application").
- set("spark.executor.memory", "1g")
+val conf = new SparkConf
+ .setMaster("local")
+ .setAppName("CountingSheep")
+ .set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
{% endhighlight %}
-Most of the properties control internal settings that have reasonable default values. However,
+## Loading Default Configurations
+
+In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control
+the configuration properties through SparkConf. However, you can still set configuration properties
+through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`)
+will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a
+key and a value separated by whitespace. For example,
+
+ spark.master spark://5.6.7.8:7077
+ spark.executor.memory 512m
+ spark.eventLog.enabled true
+ spark.serializer org.apache.spark.serializer.KryoSerializer
+
+Any values specified in the file will be passed on to the application, and merged with those
+specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf`
+and SparkConf, then the latter will take precedence as it is the most application-specific.
+
+## All Configuration Properties
+
+Most of the properties that control internal settings have reasonable default values. However,
there are at least five properties that you will commonly want to control:
<table class="table">
@@ -101,9 +117,9 @@ Apart from these, the following properties are also available, and may be useful
<td>spark.default.parallelism</td>
<td>
<ul>
+ <li>Local mode: number of cores on the local machine</li>
<li>Mesos fine grained mode: 8</li>
- <li>Local mode: core number of the local machine</li>
- <li>Others: total core number of all executor nodes or 2, whichever is larger</li>
+ <li>Others: total number of cores on all executor nodes or 2, whichever is larger</li>
</ul>
</td>
<td>
@@ -187,7 +203,7 @@ Apart from these, the following properties are also available, and may be useful
Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
standard javax servlet Filter. Parameters to each filter can also be specified by setting a
java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
- (e.g.-Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
+ (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
</td>
</tr>
<tr>
@@ -696,7 +712,9 @@ Apart from these, the following properties are also available, and may be useful
## Viewing Spark Properties
The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.
-This is a useful place to check to make sure that your properties have been set correctly.
+This is a useful place to check to make sure that your properties have been set correctly. Note
+that only values explicitly specified through either `spark-defaults.conf` or SparkConf will
+appear. For all other configuration properties, you can assume the default value is used.
# Environment Variables
@@ -714,8 +732,8 @@ The following variables can be set in `spark-env.sh`:
* `PYSPARK_PYTHON`, the Python binary to use for PySpark
* `SPARK_LOCAL_IP`, to configure which IP address of the machine to bind to.
* `SPARK_PUBLIC_DNS`, the hostname your Spark program will advertise to other machines.
-* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores
- to use on each machine and maximum memory.
+* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts),
+ such as number of cores to use on each machine and maximum memory.
Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md
index 454877a7fa..a0aeab5727 100644
--- a/docs/hadoop-third-party-distributions.md
+++ b/docs/hadoop-third-party-distributions.md
@@ -9,12 +9,14 @@ with these distributions:
# Compile-time Hadoop Version
-When compiling Spark, you'll need to
-[set the SPARK_HADOOP_VERSION flag](index.html#a-note-about-hadoop-versions):
+When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
+property. For certain versions, you will need to specify additional profiles. For more detail,
+see the guide on [building with maven](building-with-maven.html#specifying-the-hadoop-version):
- SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly
+ mvn -Dhadoop.version=1.0.4 -DskipTests clean package
+ mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
-The table below lists the corresponding `SPARK_HADOOP_VERSION` code for each CDH/HDP release. Note that
+The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
distribution may "just work" without you needing to compile. That said, we recommend compiling with
the _exact_ Hadoop version you are running to avoid any compatibility errors.
@@ -46,6 +48,10 @@ the _exact_ Hadoop version you are running to avoid any compatibility errors.
</tr>
</table>
+In SBT, the equivalent can be achieved by setting the SPARK_HADOOP_VERSION flag:
+
+ SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly
+
# Linking Applications to the Hadoop Version
In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that
diff --git a/docs/index.md b/docs/index.md
index a2f1a84371..48182a27d2 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -24,21 +24,31 @@ right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/
# Running the Examples and Shell
-Spark comes with several sample programs. Scala, Java and Python examples are in the `examples/src/main` directory.
-To run one of the Java or Scala sample programs, use `./bin/run-example <class> <params>` in the top-level Spark directory
-(the `bin/run-example` script sets up the appropriate paths and launches that program).
-For example, try `./bin/run-example org.apache.spark.examples.SparkPi local`.
-To run a Python sample program, use `./bin/pyspark <sample-program> <params>`. For example, try `./bin/pyspark ./examples/src/main/python/pi.py local`.
+Spark comes with several sample programs. Scala, Java and Python examples are in the
+`examples/src/main` directory. To run one of the Java or Scala sample programs, use
+`bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
+invokes the more general
+[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for
+launching applications). For example,
-Each example prints usage help when run with no parameters.
+ ./bin/run-example SparkPi 10
-Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
-to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
-or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
-`local` for testing.
+You can also run Spark interactively through modified versions of the Scala shell. This is a
+great way to learn the framework.
-Finally, you can run Spark interactively through modified versions of the Scala shell (`./bin/spark-shell`) or
-Python interpreter (`./bin/pyspark`). These are a great way to learn the framework.
+ ./bin/spark-shell --master local[2]
+
+The `--master` option specifies the
+[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
+locally with one thread, or `local[N]` to run locally with N threads. You should start by using
+`local` for testing. For a full list of options, run Spark shell with the `--help` option.
+
+Spark also provides a Python interface. To run an example Spark application written in Python, use
+`bin/pyspark <program> [params]`. For example,
+
+ ./bin/pyspark examples/src/main/python/pi.py local[2] 10
+
+or simply `bin/pyspark` without any arguments to run Spark interactively in a python interpreter.
# Launching on a Cluster
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
index c34eb28fc0..943fdd9d01 100644
--- a/docs/java-programming-guide.md
+++ b/docs/java-programming-guide.md
@@ -215,7 +215,4 @@ Spark includes several sample programs using the Java API in
[`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the
`bin/run-example` script included in Spark; for example:
- ./bin/run-example org.apache.spark.examples.JavaWordCount
-
-Each example program prints usage help when run
-without any arguments.
+ ./bin/run-example JavaWordCount README.md
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
index 39fb5f0c99..2ce2c346d7 100644
--- a/docs/python-programming-guide.md
+++ b/docs/python-programming-guide.md
@@ -164,6 +164,6 @@ some example applications.
PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python).
You can run them by passing the files to `pyspark`; e.g.:
- ./bin/spark-submit examples/src/main/python/wordcount.py
+ ./bin/spark-submit examples/src/main/python/wordcount.py local[2] README.md
Each program prints usage help when run without arguments.
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 478b790f92..a4d01487bb 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -18,7 +18,9 @@ you can download a package for any version of Hadoop.
## Basics
Spark's interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively.
-Start the shell by running `./bin/spark-shell` in the Spark directory.
+Start the shell by running the following in the Spark directory.
+
+ ./bin/spark-shell
Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a new RDD from the text of the README file in the Spark source directory:
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index c563594296..66c330fdee 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -54,13 +54,13 @@ For example:
--executor-memory 2g \
--executor-cores 1
lib/spark-examples*.jar \
- yarn-cluster 5
+ 10
The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Viewing Logs" section below for how to see driver and executor logs.
To launch a Spark application in yarn-client mode, do the same, but replace "yarn-cluster" with "yarn-client". To run spark-shell:
- $ MASTER=yarn-client ./bin/spark-shell
+ $ ./bin/spark-shell --master yarn-client
## Adding additional jars
@@ -70,9 +70,9 @@ In yarn-cluster mode, the driver runs on a different machine than the client, so
--master yarn-cluster \
--jars my-other-jar.jar,my-other-other-jar.jar
my-main-jar.jar
- yarn-cluster 5
+ app_arg1 app_arg2
-# Viewing logs
+# Debugging your Application
In YARN terminology, executors and application masters run inside "containers". YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the "yarn logs" command.
@@ -82,6 +82,13 @@ will print out the contents of all log files from all containers from the given
When log aggregation isn't turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID.
+To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a
+large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs
+on the nodes on which containers are launched. This directory contains the launch script, jars, and
+all environment variables used for launching each container. This process is useful for debugging
+classpath problems in particular. (Note that enabling this requires admin privileges on cluster
+settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).
+
# Important notes
- Before Hadoop 2.2, YARN does not support cores in container resource requests. Thus, when running against an earlier version, the numbers of cores given via command line arguments cannot be passed to YARN. Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index f25e9cca88..3ed86e460c 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -56,7 +56,7 @@ The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster UR
to connect to, or a special "local" string to run in local mode, as described below. `appName` is
a name for your application, which will be shown in the cluster web UI. It's also possible to set
these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
-which avoids hard-coding the master name in your application.
+which avoids hard-coding the master url in your application.
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the
variable called `sc`. Making your own SparkContext will not work. You can set which master the
@@ -74,6 +74,11 @@ Or, to also add `code.jar` to its classpath, use:
$ ./bin/spark-shell --master local[4] --jars code.jar
{% endhighlight %}
+For a complete list of options, run Spark shell with the `--help` option. Behind the scenes,
+Spark shell invokes the more general [Spark submit script](cluster-overview.html#launching-applications-with-spark-submit)
+used for launching applications, and passes on all of its parameters. As a result, these two scripts
+share the same parameters.
+
### Master URLs
The master URL passed to Spark can be in one of the following formats:
@@ -98,7 +103,7 @@ cluster mode. The cluster location will be inferred based on the local Hadoop co
</td></tr>
</table>
-If no master URL is specified, the spark shell defaults to "local[*]".
+If no master URL is specified, the spark shell defaults to `local[*]`.
# Resilient Distributed Datasets (RDDs)
@@ -432,9 +437,7 @@ res2: Int = 10
You can see some [example Spark programs](http://spark.apache.org/examples.html) on the Spark website.
In addition, Spark includes several samples in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `bin/run-example` script included in Spark; for example:
- ./bin/run-example org.apache.spark.examples.SparkPi
-
-Each example program prints usage help when run without any arguments.
+ ./bin/run-example SparkPi
For help on optimizing your program, the [configuration](configuration.html) and
[tuning](tuning.html) guides provide information on best practices. They are especially important for
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index dc7f206e03..eb3211b6b0 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -70,7 +70,7 @@ Once you've set up this file, you can launch or stop your cluster with the follo
- `sbin/start-slaves.sh` - Starts a slave instance on each machine specified in the `conf/slaves` file.
- `sbin/start-all.sh` - Starts both a master and a number of slaves as described above.
- `sbin/stop-master.sh` - Stops the master that was started via the `bin/start-master.sh` script.
-- `sbin/stop-slaves.sh` - Stops the slave instances that were started via `bin/start-slaves.sh`.
+- `sbin/stop-slaves.sh` - Stops all slave instances on the machines specified in the `conf/slaves` file.
- `sbin/stop-all.sh` - Stops both the master and the slaves as described above.
Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
@@ -92,12 +92,8 @@ You can optionally configure the cluster further by setting environment variable
<td>Port for the master web UI (default: 8080).</td>
</tr>
<tr>
- <td><code>SPARK_WORKER_PORT</code></td>
- <td>Start the Spark worker on a specific port (default: random).</td>
- </tr>
- <tr>
- <td><code>SPARK_WORKER_DIR</code></td>
- <td>Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).</td>
+ <td><code>SPARK_MASTER_OPTS</code></td>
+ <td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none).</td>
</tr>
<tr>
<td><code>SPARK_WORKER_CORES</code></td>
@@ -108,6 +104,10 @@ You can optionally configure the cluster further by setting environment variable
<td>Total amount of memory to allow Spark applications to use on the machine, e.g. <code>1000m</code>, <code>2g</code> (default: total memory minus 1 GB); note that each application's <i>individual</i> memory is configured using its <code>spark.executor.memory</code> property.</td>
</tr>
<tr>
+ <td><code>SPARK_WORKER_PORT</code></td>
+ <td>Start the Spark worker on a specific port (default: random).</td>
+ </tr>
+ <tr>
<td><code>SPARK_WORKER_WEBUI_PORT</code></td>
<td>Port for the worker web UI (default: 8081).</td>
</tr>
@@ -121,12 +121,24 @@ You can optionally configure the cluster further by setting environment variable
</td>
</tr>
<tr>
+ <td><code>SPARK_WORKER_DIR</code></td>
+ <td>Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).</td>
+ </tr>
+ <tr>
+ <td><code>SPARK_WORKER_OPTS</code></td>
+ <td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none).</td>
+ </tr>
+ <tr>
<td><code>SPARK_DAEMON_MEMORY</code></td>
<td>Memory to allocate to the Spark master and worker daemons themselves (default: 512m).</td>
</tr>
<tr>
<td><code>SPARK_DAEMON_JAVA_OPTS</code></td>
- <td>JVM options for the Spark master and worker daemons themselves (default: none).</td>
+ <td>JVM options for the Spark master and worker daemons themselves in the form "-Dx=y" (default: none).</td>
+ </tr>
+ <tr>
+ <td><code>SPARK_PUBLIC_DNS</code></td>
+ <td>The public DNS name of the Spark master and workers (default: none).</td>
</tr>
</table>
@@ -148,38 +160,17 @@ You can also pass an option `--cores <numCores>` to control the number of cores
# Launching Compiled Spark Applications
-Spark supports two deploy modes. Spark applications may run with the driver inside the client process or entirely inside the cluster.
-
-The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For info on the lower-level invocations used to launch an app inside the cluster, read ahead.
-
-## Launching Applications Inside the Cluster
-
- ./bin/spark-class org.apache.spark.deploy.Client launch
- [client-options] \
- <cluster-url> <application-jar-url> <main-class> \
- [application-options]
-
- cluster-url: The URL of the master node.
- application-jar-url: Path to a bundled jar including your application and all dependencies. Currently, the URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
- main-class: The entry point for your application.
-
- Client Options:
- --memory <count> (amount of memory, in MB, allocated for your driver program)
- --cores <count> (number of cores allocated for your driver program)
- --supervise (whether to automatically restart your driver on application or node failure)
- --verbose (prints increased logging output)
-
-Keep in mind that your driver program will be executed on a remote worker machine. You can control the execution environment in the following ways:
-
- * _Environment variables_: These will be captured from the environment in which you launch the client and applied when launching the driver program.
- * _Java options_: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in which you launch the submission client.
- * _Dependencies_: You'll still need to call `sc.addJar` inside of your program to make your bundled application jar visible on all worker nodes.
-
-Once you submit a driver program, it will appear in the cluster management UI at port 8080 and
-be assigned an identifier. If you'd like to prematurely terminate the program, you can do so using
-the same client:
+Spark supports two deploy modes: applications may run with the driver inside the client process or
+entirely inside the cluster. The
+[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) provides the
+most straightforward way to submit a compiled Spark application to the cluster in either deploy
+mode.
- ./bin/spark-class org.apache.spark.deploy.Client kill <driverId>
+If your application is launched through Spark submit, then the application jar is automatically
+distributed to all worker nodes. For any additional jars that your application depends on, you
+should specify them through the `--jars` flag using comma as a delimiter (e.g. `--jars jar1,jar2`).
+To control the application's configuration or execution environment, see
+[Spark Configuration](configuration.html).
# Resource Scheduling
@@ -203,7 +194,7 @@ default for applications that don't set `spark.cores.max` to something less than
Do this by adding the following to `conf/spark-env.sh`:
{% highlight bash %}
-export SPARK_JAVA_OPTS="-Dspark.deploy.defaultCores=<value>"
+export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=<value>"
{% endhighlight %}
This is useful on shared clusters where users might not have configured a maximum number of cores