From b004150adb503ddbb54d5cd544e39ad974497c41 Mon Sep 17 00:00:00 2001 From: Tathagata Das Date: Thu, 11 Dec 2014 06:21:23 -0800 Subject: [SPARK-4806] Streaming doc update for 1.2 Important updates to the streaming programming guide - Make the fault-tolerance properties easier to understand, with information about write ahead logs - Update the information about deploying the spark streaming app with information about Driver HA - Update Receiver guide to discuss reliable vs unreliable receivers. Author: Tathagata Das Author: Josh Rosen Author: Josh Rosen Closes #3653 from tdas/streaming-doc-update-1.2 and squashes the following commits: f53154a [Tathagata Das] Addressed Josh's comments. ce299e4 [Tathagata Das] Minor update. ca19078 [Tathagata Das] Minor change f746951 [Tathagata Das] Mentioned performance problem with WAL 7787209 [Tathagata Das] Merge branch 'streaming-doc-update-1.2' of github.com:tdas/spark into streaming-doc-update-1.2 2184729 [Tathagata Das] Updated Kafka and Flume guides with reliability information. 2f3178c [Tathagata Das] Added more information about writing reliable receivers in the custom receiver guide. 91aa5aa [Tathagata Das] Improved API Docs menu 5707581 [Tathagata Das] Added Pythn API badge b9c8c24 [Tathagata Das] Merge pull request #26 from JoshRosen/streaming-programming-guide b8c8382 [Josh Rosen] minor fixes a4ef126 [Josh Rosen] Restructure parts of the fault-tolerance section to read a bit nicer when skipping over the headings 65f66cd [Josh Rosen] Fix broken link to fault-tolerance semantics section. f015397 [Josh Rosen] Minor grammar / pluralization fixes. 3019f3a [Josh Rosen] Fix minor Markdown formatting issues aa8bb87 [Tathagata Das] Small update. 195852c [Tathagata Das] Updated based on Josh's comments, updated receiver reliability and deploying section, and also updated configuration. 17b99fb [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-doc-update-1.2 a0217c0 [Tathagata Das] Changed Deploying menu layout 67fcffc [Tathagata Das] Added cluster mode + supervise example to submitting application guide. e45453b [Tathagata Das] Update streaming guide, added deploying section. 192c7a7 [Tathagata Das] Added more info about Python API, and rewrote the checkpointing section. --- docs/submitting-applications.md | 36 ++++++++++++++++++++++++++---------- 1 file changed, 26 insertions(+), 10 deletions(-) (limited to 'docs/submitting-applications.md') diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 45b70b1a54..2581c9f69f 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -43,17 +43,18 @@ Some of the commonly used options are: * `--class`: The entry point for your application (e.g. `org.apache.spark.examples.SparkPi`) * `--master`: The [master URL](#master-urls) for the cluster (e.g. `spark://23.195.26.187:7077`) -* `--deploy-mode`: Whether to deploy your driver on the worker nodes (`cluster`) or locally as an external client (`client`) (default: `client`)* +* `--deploy-mode`: Whether to deploy your driver on the worker nodes (`cluster`) or locally as an external client (`client`) (default: `client`) * `--conf`: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap "key=value" in quotes (as shown). * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -*A common deployment strategy is to submit your application from a gateway machine that is + A common deployment strategy is to submit your application from a gateway machine +that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, `client` mode is appropriate. In `client` mode, the driver is launched directly -within the client `spark-submit` process, with the input and output of the application attached -to the console. Thus, this mode is especially suitable for applications that involve the REPL -(e.g. Spark shell). +within the `spark-submit` process which acts as a *client* to the cluster. The input and +output of the application is attached to the console. Thus, this mode is especially suitable +for applications that involve the REPL (e.g. Spark shell). Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use `cluster` mode to minimize network latency between @@ -63,8 +64,12 @@ clusters, Mesos clusters, or python applications. For Python applications, simply pass a `.py` file in the place of `` instead of a JAR, and add Python `.zip`, `.egg` or `.py` files to the search path with `--py-files`. -To enumerate all options available to `spark-submit` run it with `--help`. Here are a few -examples of common options: +There are a few options available that are specific to the +[cluster manager](#cluster-overview.html#cluster-manager-types) that is being used. +For example, with a [Spark Standalone](#spark-standalone) cluster with `cluster` deploy mode, +you can also specify `--supervise` to make sure that the driver is automatically restarted if it +fails with non-zero exit code. To enumerate all such options available to `spark-submit`, +run it with `--help`. Here are a few examples of common options: {% highlight bash %} # Run application locally on 8 cores @@ -74,7 +79,7 @@ examples of common options: /path/to/examples.jar \ 100 -# Run on a Spark standalone cluster +# Run on a Spark Standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ @@ -83,6 +88,17 @@ examples of common options: /path/to/examples.jar \ 1000 +# Run on a Spark Standalone cluster in cluster deploy mode with supervise +./bin/spark-submit \ + --class org.apache.spark.examples.SparkPi \ + --master spark://207.184.161.138:7077 \ + --deploy-mode cluster + --supervise + --executor-memory 20G \ + --total-executor-cores 100 \ + /path/to/examples.jar \ + 1000 + # Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ @@ -93,7 +109,7 @@ export HADOOP_CONF_DIR=XXX /path/to/examples.jar \ 1000 -# Run a Python application on a cluster +# Run a Python application on a Spark Standalone cluster ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ @@ -163,5 +179,5 @@ to executors. # More Information -Once you have deployed your application, the [cluster mode overview](cluster-overview.html) describes +Once you have deployed your application, the [cluster mode overview](cluster-overview.html) describes the components involved in distributed execution, and how to monitor and debug applications. -- cgit v1.2.3