SPARK-1126. spark-app preliminary

This is a starting version of the spark-app script for running compiled binaries against Spark. It still needs tests and some polish. The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster. This leaves out the changes required for launching python scripts. I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes). Author: Sandy Ryza <sandy@cloudera.com> Closes #86 from sryza/sandy-spark-1126 and squashes the following commits: d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments e7315c6 [Sandy Ryza] Fix failing tests 34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs 299ddca [Sandy Ryza] Fix scalastyle a94c627 [Sandy Ryza] Add newline at end of SparkSubmit 04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script
author: Sandy Ryza <sandy@cloudera.com> 2014-03-29 14:41:36 -0700
committer: Patrick Wendell <pwendell@gmail.com> 2014-03-29 14:41:36 -0700
commit: 1617816090e7b20124a512a43860a21232ebf511 (patch)
tree: cb6e45d21cb59edd81ab3bc29b9e00ab034bb90d /docs
parent: 3738f24421d6f3bd10e5ef9ebfc10f702a5cb7ac (diff)
download: spark-1617816090e7b20124a512a43860a21232ebf511.tar.gz
spark-1617816090e7b20124a512a43860a21232ebf511.tar.bz2
spark-1617816090e7b20124a512a43860a21232ebf511.zip
3 files changed, 59 insertions, 4 deletions
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
index a555a7b502..b69e3416fb 100644
--- a/docs/cluster-overview.md
+++ b/docs/cluster-overview.md
@@ -50,6 +50,50 @@ The system currently supports three cluster managers:
 In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone
 cluster on Amazon EC2.
 
+# Launching Applications
+
+The recommended way to launch a compiled Spark application is through the spark-submit script (located in the
+bin directory), which takes care of setting up the classpath with Spark and its dependencies, as well as
+provides a layer over the different cluster managers and deploy modes that Spark supports.  It's usage is
+
+  spark-submit `<jar>` `<options>`
+
+Where options are any of:
+
+- **\--class** - The main class to run.
+- **\--master** - The URL of the cluster manager master, e.g. spark://host:port, mesos://host:port, yarn,
+  or local.
+- **\--deploy-mode** - "client" to run the driver in the client process or "cluster" to run the driver in
+  a process on the cluster.  For Mesos, only "client" is supported.
+- **\--executor-memory** - Memory per executor (e.g. 1000M, 2G).
+- **\--executor-cores** - Number of cores per executor. (Default: 2)
+- **\--driver-memory** - Memory for driver (e.g. 1000M, 2G)
+- **\--name** - Name of the application.
+- **\--arg** - Argument to be passed to the application's main class. This option can be specified
+  multiple times to pass multiple arguments.
+- **\--jars** - A comma-separated list of local jars to include on the driver classpath and that
+  SparkContext.addJar will work with. Doesn't work on standalone with 'cluster' deploy mode.
+
+The following currently only work for Spark standalone with cluster deploy mode:
+
+- **\--driver-cores** - Cores for driver (Default: 1).
+- **\--supervise** - If given, restarts the driver on failure.
+
+The following only works for Spark standalone and Mesos only:
+
+- **\--total-executor-cores** - Total cores for all executors.
+
+The following currently only work for YARN:
+
+- **\--queue** - The YARN queue to place the application in.
+- **\--files** - Comma separated list of files to be placed in the working dir of each executor.
+- **\--archives** - Comma separated list of archives to be extracted into the working dir of each
+  executor.
+- **\--num-executors** - Number of executors (Default: 2).
+
+The master and deploy mode can also be set with the MASTER and DEPLOY_MODE environment variables.
+Values for these options passed via command line will override the environment variables.
+
 # Shipping Code to the Cluster
 
 The recommended way to ship your code to the cluster is to pass it through SparkContext's constructor,
@@ -103,6 +147,12 @@ The following table summarizes terms you'll see used to refer to cluster concept
       <td>An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)</td>
     </tr>
     <tr>
+      <td>Deploy mode</td>
+      <td>Distinguishes where the driver process runs. In "cluster" mode, the framework launches
+        the driver inside of the cluster. In "client" mode, the submitter launches the driver
+        outside of the cluster.</td>
+    <tr>
+    <tr>
       <td>Worker node</td>
       <td>Any node that can run application code in the cluster</td>
     </tr>
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 2e9dec4856..d8657c4bc7 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -48,10 +48,12 @@ System Properties:
 Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster.
 These configs are used to connect to the cluster, write to the dfs, and connect to the YARN ResourceManager.
 
-There are two scheduler modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
+There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
 
 Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the "master" parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration.  Thus, the master parameter is simply "yarn-client" or "yarn-cluster".
 
+The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to YARN in either deploy mode. For info on the lower-level invocations it uses, read ahead. For running spark-shell against YARN, skip down to the yarn-client section. 
+
 ## Launching a Spark application with yarn-cluster mode.
 
 The command to launch the Spark application on the cluster is as follows:
@@ -121,7 +123,7 @@ or
     MASTER=yarn-client ./bin/spark-shell
 
 
-## Viewing logs
+# Viewing logs
 
 In YARN terminology, executors and application masters run inside "containers". YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the "yarn logs" command.
 
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 51fb3a4f7f..7e4eea323a 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -146,10 +146,13 @@ automatically set MASTER from the `SPARK_MASTER_IP` and `SPARK_MASTER_PORT` vari
 
 You can also pass an option `-c <numCores>` to control the number of cores that spark-shell uses on the cluster.
 
-# Launching Applications Inside the Cluster
+# Launching Compiled Spark Applications
 
-You may also run your application entirely inside of the cluster by submitting your application driver using the submission client. The syntax for submitting applications is as follows:
+Spark supports two deploy modes. Spark applications may run with the driver inside the client process or entirely inside the cluster.
 
+The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For info on the lower-level invocations used to launch an app inside the cluster, read ahead.
+
+## Launching Applications Inside the Cluster
 
     ./bin/spark-class org.apache.spark.deploy.Client launch
        [client-options] \
author	Sandy Ryza <sandy@cloudera.com>	2014-03-29 14:41:36 -0700
committer	Patrick Wendell <pwendell@gmail.com>	2014-03-29 14:41:36 -0700
commit	1617816090e7b20124a512a43860a21232ebf511 (patch)
tree	cb6e45d21cb59edd81ab3bc29b9e00ab034bb90d /docs
parent	3738f24421d6f3bd10e5ef9ebfc10f702a5cb7ac (diff)
download	spark-1617816090e7b20124a512a43860a21232ebf511.tar.gz spark-1617816090e7b20124a512a43860a21232ebf511.tar.bz2 spark-1617816090e7b20124a512a43860a21232ebf511.zip