diff options
Diffstat (limited to 'docs/cluster-overview.md')
-rw-r--r-- | docs/cluster-overview.md | 108 |
1 files changed, 6 insertions, 102 deletions
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index f05a755de7..6a75d5c457 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -4,7 +4,8 @@ title: Cluster Mode Overview --- This document gives a short overview of how Spark runs on clusters, to make it easier to understand -the components involved. +the components involved. Read through the [application submission guide](submitting-applications.html) +to submit applications to a cluster. # Components @@ -50,107 +51,10 @@ The system currently supports three cluster managers: In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone cluster on Amazon EC2. -# Bundling and Launching Applications - -### Bundling Your Application's Dependencies -If your code depends on other projects, you will need to package them alongside -your application in order to distribute the code to a Spark cluster. To do this, -to create an assembly jar (or "uber" jar) containing your code and its dependencies. Both -[sbt](https://github.com/sbt/sbt-assembly) and -[Maven](http://maven.apache.org/plugins/maven-shade-plugin/) -have assembly plugins. When creating assembly jars, list Spark and Hadoop -as `provided` dependencies; these need not be bundled since they are provided by -the cluster manager at runtime. Once you have an assembled jar you can call the `bin/spark-submit` -script as shown here while passing your jar. - -For Python, you can use the `pyFiles` argument of SparkContext -or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed. - -### Launching Applications with Spark submit - -Once a user application is bundled, it can be launched using the `spark-submit` script located in -the bin directory. This script takes care of setting up the classpath with Spark and its -dependencies, and can support different cluster managers and deploy modes that Spark supports: - - ./bin/spark-submit \ - --class <main-class> - --master <master-url> \ - --deploy-mode <deploy-mode> \ - ... // other options - <application-jar> - [application-arguments] - - main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) - master-url: The URL of the master node (e.g. spark://23.195.26.187:7077) - deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client) - application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. - application-arguments: Space delimited arguments passed to the main method of <main-class>, if any - -To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few -examples of common options: - -{% highlight bash %} -# Run application locally -./bin/spark-submit \ - --class org.apache.spark.examples.SparkPi - --master local[8] \ - /path/to/examples.jar \ - 100 - -# Run on a Spark standalone cluster -./bin/spark-submit \ - --class org.apache.spark.examples.SparkPi - --master spark://207.184.161.138:7077 \ - --executor-memory 20G \ - --total-executor-cores 100 \ - /path/to/examples.jar \ - 1000 - -# Run on a YARN cluster -HADOOP_CONF_DIR=XX ./bin/spark-submit \ - --class org.apache.spark.examples.SparkPi - --master yarn-cluster \ # can also be `yarn-client` for client mode - --executor-memory 20G \ - --num-executors 50 \ - /path/to/examples.jar \ - 1000 -{% endhighlight %} - -### Loading Configurations from a File - -The `spark-submit` script can load default [Spark configuration values](configuration.html) from a -properties file and pass them on to your application. By default it will read configuration options -from `conf/spark-defaults.conf`. For more detail, see the section on -[loading default configurations](configuration.html#loading-default-configurations). - -Loading default Spark configurations this way can obviate the need for certain flags to -`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the -`--master` flag from `spark-submit`. In general, configuration values explicitly set on a -`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the -defaults file. - -If you are ever unclear where configuration options are coming from, you can print out fine-grained -debugging information by running `spark-submit` with the `--verbose` option. - -### Advanced Dependency Management -When using `spark-submit`, the application jar along with any jars included with the `--jars` option -will be automatically transferred to the cluster. Spark uses the following URL scheme to allow -different strategies for disseminating jars: - -- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and - every executor pulls the file from the driver HTTP server. -- **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected -- **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node. This - means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, - or shared via NFS, GlusterFS, etc. - -Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. -This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup -is handled automatically, and with Spark standalone, automatic cleanup can be configured with the -`spark.worker.cleanup.appDataTtl` property. - -For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries -to executors. +# Submitting Applications + +Applications can be submitted to a cluster of any type using the `spark-submit` script. +The [application submission guide](submitting-applications.html) describes how to do this. # Monitoring |