From 2173f6c7cac877a3b756d63aabf7bdd06a18e6d9 Mon Sep 17 00:00:00 2001 From: Matei Zaharia Date: Mon, 21 Jan 2013 13:02:40 -0800 Subject: Clarify the documentation on env variables for standalone mode --- docs/spark-standalone.md | 43 +++++++++++++++++++++---------------------- 1 file changed, 21 insertions(+), 22 deletions(-) (limited to 'docs') diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index e0ba7c35cb..bf296221b8 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -51,11 +51,11 @@ Finally, the following configuration options can be passed to the master and wor -c CORES, --cores CORES - Number of CPU cores to use (default: all available); only on worker + Total CPU cores to allow Spark jobs to use on the machine (default: all available); only on worker -m MEM, --memory MEM - Amount of memory to use, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GB); only on worker + Total amount of memory to allow Spark jobs to use on the machine, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GB); only on worker -d DIR, --work-dir DIR @@ -66,9 +66,20 @@ Finally, the following configuration options can be passed to the master and wor # Cluster Launch Scripts -To launch a Spark standalone cluster with the deploy scripts, you need to set up two files, `conf/spark-env.sh` and `conf/slaves`. The `conf/spark-env.sh` file lets you specify global settings for the master and slave instances, such as memory, or port numbers to bind to, while `conf/slaves` is a list of slave nodes. The system requires that all the slave machines have the same configuration files, so *copy these files to each machine*. +To launch a Spark standalone cluster with the deploy scripts, you need to create a file called `conf/slaves` in your Spark directory, which should contain the hostnames of all the machines where you would like to start Spark workers, one per line. The master machine must be able to access each of the slave machines via password-less `ssh` (using a private key). For testing, you can just put `localhost` in this file. -In `conf/spark-env.sh`, you can set the following parameters, in addition to the [standard Spark configuration settings](configuration.html): +Once you've set up this fine, you can launch or stop your cluster with the following shell scripts, based on Hadoop's deploy scripts, and available in `SPARK_HOME/bin`: + +- `bin/start-master.sh` - Starts a master instance on the machine the script is executed on. +- `bin/start-slaves.sh` - Starts a slave instance on each machine specified in the `conf/slaves` file. +- `bin/start-all.sh` - Starts both a master and a number of slaves as described above. +- `bin/stop-master.sh` - Stops the master that was started via the `bin/start-master.sh` script. +- `bin/stop-slaves.sh` - Stops the slave instances that were started via `bin/start-slaves.sh`. +- `bin/stop-all.sh` - Stops both the master and the slaves as described above. + +Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine. + +You can optionally configure the cluster further by setting environment variables in `conf/spark-env.sh`. Create this file by starting with the `conf/spark-env.sh.template`, and _copy it to all your worker machines_ for the settings to take effect. The following settings are available: @@ -88,36 +99,24 @@ In `conf/spark-env.sh`, you can set the following parameters, in addition to the + + + + - + - + - - - -
Environment VariableMeaning
SPARK_WORKER_PORT Start the Spark worker on a specific port (default: random)
SPARK_WORKER_DIRDirectory to run jobs in, which will include both logs and scratch space (default: SPARK_HOME/work)
SPARK_WORKER_CORESNumber of cores to use (default: all available cores)Total number of cores to allow Spark jobs to use on the machine (default: all available cores)
SPARK_WORKER_MEMORYHow much memory to use, e.g. 1000M, 2G (default: total memory minus 1 GB)Total amount of memory to allow Spark jobs to use on the machine, e.g. 1000M, 2G (default: total memory minus 1 GB); note that each job's individual memory is configured using SPARK_MEM
SPARK_WORKER_WEBUI_PORT Port for the worker web UI (default: 8081)
SPARK_WORKER_DIRDirectory to run jobs in, which will include both logs and scratch space (default: SPARK_HOME/work)
-In `conf/slaves`, include a list of all machines where you would like to start a Spark worker, one per line. The master machine must be able to access each of the slave machines via password-less `ssh` (using a private key). For testing purposes, you can have a single `localhost` entry in the slaves file. - -Once you've set up these configuration files, you can launch or stop your cluster with the following shell scripts, based on Hadoop's deploy scripts, and available in `SPARK_HOME/bin`: - -- `bin/start-master.sh` - Starts a master instance on the machine the script is executed on. -- `bin/start-slaves.sh` - Starts a slave instance on each machine specified in the `conf/slaves` file. -- `bin/start-all.sh` - Starts both a master and a number of slaves as described above. -- `bin/stop-master.sh` - Stops the master that was started via the `bin/start-master.sh` script. -- `bin/stop-slaves.sh` - Stops the slave instances that were started via `bin/start-slaves.sh`. -- `bin/stop-all.sh` - Stops both the master and the slaves as described above. - -Note that the scripts must be executed on the machine you want to run the Spark master on, not your local machine. # Connecting a Job to the Cluster -- cgit v1.2.3