aboutsummaryrefslogtreecommitdiff
path: root/docs/cluster-overview.md
diff options
context:
space:
mode:
authorAndrew Or <andrewor14@gmail.com>2014-05-12 19:44:14 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-05-12 19:44:14 -0700
commit2ffd1eafd28635dcecc0ac738d4a62c05d740925 (patch)
tree0c2b30a97dfd24fc6268d4f429111fe6c7348bbe /docs/cluster-overview.md
parentba96bb3d591130075763706526f86fb2aaffa3ae (diff)
downloadspark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.tar.gz
spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.tar.bz2
spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.zip
[SPARK-1753 / 1773 / 1814] Update outdated docs for spark-submit, YARN, standalone etc.
YARN - SparkPi was updated to not take in master as an argument; we should update the docs to reflect that. - The default YARN build guide should be in maven, not sbt. - This PR also adds a paragraph on steps to debug a YARN application. Standalone - Emphasize spark-submit more. Right now it's one small paragraph preceding the legacy way of launching through `org.apache.spark.deploy.Client`. - The way we set configurations / environment variables according to the old docs is outdated. This needs to reflect changes introduced by the Spark configuration changes we made. In general, this PR also adds a little more documentation on the new spark-shell, spark-submit, spark-defaults.conf etc here and there. Author: Andrew Or <andrewor14@gmail.com> Closes #701 from andrewor14/yarn-docs and squashes the following commits: e2c2312 [Andrew Or] Merge in changes in #752 (SPARK-1814) 25cfe7b [Andrew Or] Merge in the warning from SPARK-1753 a8c39c5 [Andrew Or] Minor changes 336bbd9 [Andrew Or] Tabs -> spaces 4d9d8f7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 041017a [Andrew Or] Abstract Spark submit documentation to cluster-overview.html 3cc0649 [Andrew Or] Detail how to set configurations + remove legacy instructions 5b7140a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 85a51fc [Andrew Or] Update run-example, spark-shell, configuration etc. c10e8c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 381fe32 [Andrew Or] Update docs for standalone mode 757c184 [Andrew Or] Add a note about the requirements for the debugging trick f8ca990 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 924f04c [Andrew Or] Revert addition of --deploy-mode d5fe17b [Andrew Or] Update the YARN docs
Diffstat (limited to 'docs/cluster-overview.md')
-rw-r--r--docs/cluster-overview.md73
1 files changed, 45 insertions, 28 deletions
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
index 162c415b58..f05a755de7 100644
--- a/docs/cluster-overview.md
+++ b/docs/cluster-overview.md
@@ -66,62 +66,76 @@ script as shown here while passing your jar.
For Python, you can use the `pyFiles` argument of SparkContext
or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.
-### Launching Applications with ./bin/spark-submit
+### Launching Applications with Spark submit
Once a user application is bundled, it can be launched using the `spark-submit` script located in
the bin directory. This script takes care of setting up the classpath with Spark and its
-dependencies, and can support different cluster managers and deploy modes that Spark supports.
-It's usage is
+dependencies, and can support different cluster managers and deploy modes that Spark supports:
- ./bin/spark-submit --class path.to.your.Class [options] <app jar> [app options]
+ ./bin/spark-submit \
+ --class <main-class>
+ --master <master-url> \
+ --deploy-mode <deploy-mode> \
+ ... // other options
+ <application-jar>
+ [application-arguments]
-When calling `spark-submit`, `[app options]` will be passed along to your application's
-main class. To enumerate all options available to `spark-submit` run it with
-the `--help` flag. Here are a few examples of common options:
+ main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
+ master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
+ deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
+ application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
+ application-arguments: Space delimited arguments passed to the main method of <main-class>, if any
+
+To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few
+examples of common options:
{% highlight bash %}
# Run application locally
./bin/spark-submit \
- --class my.main.ClassName
+ --class org.apache.spark.examples.SparkPi
--master local[8] \
- my-app.jar
+ /path/to/examples.jar \
+ 100
# Run on a Spark standalone cluster
./bin/spark-submit \
- --class my.main.ClassName
- --master spark://mycluster:7077 \
+ --class org.apache.spark.examples.SparkPi
+ --master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
- my-app.jar
+ /path/to/examples.jar \
+ 1000
# Run on a YARN cluster
-HADOOP_CONF_DIR=XX /bin/spark-submit \
- --class my.main.ClassName
+HADOOP_CONF_DIR=XX ./bin/spark-submit \
+ --class org.apache.spark.examples.SparkPi
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
- my-app.jar
+ /path/to/examples.jar \
+ 1000
{% endhighlight %}
### Loading Configurations from a File
-The `spark-submit` script can load default `SparkConf` values from a properties file and pass them
-onto your application. By default it will read configuration options from
-`conf/spark-defaults.conf`. Any values specified in the file will be passed on to the
-application when run. They can obviate the need for certain flags to `spark-submit`: for
-instance, if `spark.master` property is set, you can safely omit the
+The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
+properties file and pass them on to your application. By default it will read configuration options
+from `conf/spark-defaults.conf`. For more detail, see the section on
+[loading default configurations](configuration.html#loading-default-configurations).
+
+Loading default Spark configurations this way can obviate the need for certain flags to
+`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
-`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values
-in the defaults file.
+`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
+defaults file.
-If you are ever unclear where configuration options are coming from. fine-grained debugging
-information can be printed by adding the `--verbose` option to `./spark-submit`.
+If you are ever unclear where configuration options are coming from, you can print out fine-grained
+debugging information by running `spark-submit` with the `--verbose` option.
### Advanced Dependency Management
-When using `./bin/spark-submit` the app jar along with any jars included with the `--jars` option
-will be automatically transferred to the cluster. `--jars` can also be used to distribute .egg and .zip
-libraries for Python to executors. Spark uses the following URL scheme to allow different
-strategies for disseminating jars:
+When using `spark-submit`, the application jar along with any jars included with the `--jars` option
+will be automatically transferred to the cluster. Spark uses the following URL scheme to allow
+different strategies for disseminating jars:
- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
every executor pulls the file from the driver HTTP server.
@@ -135,6 +149,9 @@ This can use up a significant amount of space over time and will need to be clea
is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
`spark.worker.cleanup.appDataTtl` property.
+For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries
+to executors.
+
# Monitoring
Each driver program has a web UI, typically on port 4040, that displays information about running