diff options
author | Tathagata Das <tathagata.das1565@gmail.com> | 2014-12-11 06:21:23 -0800 |
---|---|---|
committer | Tathagata Das <tathagata.das1565@gmail.com> | 2014-12-11 06:21:23 -0800 |
commit | b004150adb503ddbb54d5cd544e39ad974497c41 (patch) | |
tree | d278b4cd3c2311cef7394d1c65d530c5530d3c2b /docs/submitting-applications.md | |
parent | 2a5b5fd4ccf28fab5b7e32a54170be92d5d23ba6 (diff) | |
download | spark-b004150adb503ddbb54d5cd544e39ad974497c41.tar.gz spark-b004150adb503ddbb54d5cd544e39ad974497c41.tar.bz2 spark-b004150adb503ddbb54d5cd544e39ad974497c41.zip |
[SPARK-4806] Streaming doc update for 1.2
Important updates to the streaming programming guide
- Make the fault-tolerance properties easier to understand, with information about write ahead logs
- Update the information about deploying the spark streaming app with information about Driver HA
- Update Receiver guide to discuss reliable vs unreliable receivers.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits:
f53154a [Tathagata Das] Addressed Josh's comments.
ce299e4 [Tathagata Das] Minor update.
ca19078 [Tathagata Das] Minor change
f746951 [Tathagata Das] Mentioned performance problem with WAL
7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2
2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information.
2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide.
91aa5aa [Tathagata Das] Improved API Docs menu
5707581 [Tathagata Das] Added Pythn API badge
b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide
b8c8382 [Josh Rosen] minor fixes
a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings
65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section.
f015397 [Josh Rosen] Minor grammar / pluralization fixes.
3019f3a [Josh Rosen] Fix minor Markdown formatting issues
aa8bb87 [Tathagata Das] Small update.
195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration.
17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2
a0217c0 [Tathagata Das] Changed Deploying menu layout
67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide.
e45453b [Tathagata Das] Update streaming guide, added deploying section.
192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section.
Diffstat (limited to 'docs/submitting-applications.md')
-rw-r--r-- | docs/submitting-applications.md | 36 |
1 files changed, 26 insertions, 10 deletions
diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 45b70b1a54..2581c9f69f 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -43,17 +43,18 @@ Some of the commonly used options are: * `--class`: The entry point for your application (e.g. `org.apache.spark.examples.SparkPi`) * `--master`: The [master URL](#master-urls) for the cluster (e.g. `spark://23.195.26.187:7077`) -* `--deploy-mode`: Whether to deploy your driver on the worker nodes (`cluster`) or locally as an external client (`client`) (default: `client`)* +* `--deploy-mode`: Whether to deploy your driver on the worker nodes (`cluster`) or locally as an external client (`client`) (default: `client`) <b> † </b> * `--conf`: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap "key=value" in quotes (as shown). * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -*A common deployment strategy is to submit your application from a gateway machine that is +<b>†</b> A common deployment strategy is to submit your application from a gateway machine +that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, `client` mode is appropriate. In `client` mode, the driver is launched directly -within the client `spark-submit` process, with the input and output of the application attached -to the console. Thus, this mode is especially suitable for applications that involve the REPL -(e.g. Spark shell). +within the `spark-submit` process which acts as a *client* to the cluster. The input and +output of the application is attached to the console. Thus, this mode is especially suitable +for applications that involve the REPL (e.g. Spark shell). Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use `cluster` mode to minimize network latency between @@ -63,8 +64,12 @@ clusters, Mesos clusters, or python applications. For Python applications, simply pass a `.py` file in the place of `<application-jar>` instead of a JAR, and add Python `.zip`, `.egg` or `.py` files to the search path with `--py-files`. -To enumerate all options available to `spark-submit` run it with `--help`. Here are a few -examples of common options: +There are a few options available that are specific to the +[cluster manager](#cluster-overview.html#cluster-manager-types) that is being used. +For example, with a [Spark Standalone](#spark-standalone) cluster with `cluster` deploy mode, +you can also specify `--supervise` to make sure that the driver is automatically restarted if it +fails with non-zero exit code. To enumerate all such options available to `spark-submit`, +run it with `--help`. Here are a few examples of common options: {% highlight bash %} # Run application locally on 8 cores @@ -74,7 +79,7 @@ examples of common options: /path/to/examples.jar \ 100 -# Run on a Spark standalone cluster +# Run on a Spark Standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ @@ -83,6 +88,17 @@ examples of common options: /path/to/examples.jar \ 1000 +# Run on a Spark Standalone cluster in cluster deploy mode with supervise +./bin/spark-submit \ + --class org.apache.spark.examples.SparkPi \ + --master spark://207.184.161.138:7077 \ + --deploy-mode cluster + --supervise + --executor-memory 20G \ + --total-executor-cores 100 \ + /path/to/examples.jar \ + 1000 + # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ @@ -93,7 +109,7 @@ export HADOOP_CONF_DIR=XXX /path/to/examples.jar \ 1000 -# Run a Python application on a cluster +# Run a Python application on a Spark Standalone cluster ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ @@ -163,5 +179,5 @@ to executors. # More Information -Once you have deployed your application, the [cluster mode overview](cluster-overview.html) describes +Once you have deployed your application, the [cluster mode overview](cluster-overview.html) describes the components involved in distributed execution, and how to monitor and debug applications. |