aboutsummaryrefslogtreecommitdiff
path: root/docs/ec2-scripts.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/ec2-scripts.md')
-rw-r--r--docs/ec2-scripts.md159
1 files changed, 159 insertions, 0 deletions
diff --git a/docs/ec2-scripts.md b/docs/ec2-scripts.md
new file mode 100644
index 0000000000..6e1f7fd3b1
--- /dev/null
+++ b/docs/ec2-scripts.md
@@ -0,0 +1,159 @@
+---
+layout: global
+title: Running Spark on EC2
+---
+
+The `spark-ec2` script, located in Spark's `ec2` directory, allows you
+to launch, manage and shut down Spark clusters on Amazon EC2. It automatically sets up Mesos, Spark and HDFS
+on the cluster for you.
+This guide describes how to use `spark-ec2` to launch clusters, how to run jobs on them, and how to shut them down.
+It assumes you've already signed up for an EC2 account on the [Amazon Web Services site](http://aws.amazon.com/).
+
+`spark-ec2` is designed to manage multiple named clusters. You can
+launch a new cluster (telling the script its size and giving it a name),
+shutdown an existing cluster, or log into a cluster. Each cluster is
+identified by placing its machines into EC2 security groups whose names
+are derived from the name of the cluster. For example, a cluster named
+`test` will contain a master node in a security group called
+`test-master`, and a number of slave nodes in a security group called
+`test-slaves`. The `spark-ec2` script will create these security groups
+for you based on the cluster name you request. You can also use them to
+identify machines belonging to each cluster in the Amazon EC2 Console.
+
+
+# Before You Start
+
+- Create an Amazon EC2 key pair for yourself. This can be done by
+ logging into your Amazon Web Services account through the [AWS
+ console](http://aws.amazon.com/console/), clicking Key Pairs on the
+ left sidebar, and creating and downloading a key. Make sure that you
+ set the permissions for the private key file to `600` (i.e. only you
+ can read and write it) so that `ssh` will work.
+- Whenever you want to use the `spark-ec2` script, set the environment
+ variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to your
+ Amazon EC2 access key ID and secret access key. These can be
+ obtained from the [AWS homepage](http://aws.amazon.com/) by clicking
+ Account \> Security Credentials \> Access Credentials.
+
+# Launching a Cluster
+
+- Go into the `ec2` directory in the release of Spark you downloaded.
+- Run
+ `./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>`,
+ where `<keypair>` is the name of your EC2 key pair (that you gave it
+ when you created it), `<key-file>` is the private key file for your
+ key pair, `<num-slaves>` is the number of slave nodes to launch (try
+ 1 at first), and `<cluster-name>` is the name to give to your
+ cluster.
+- After everything launches, check that Mesos is up and sees all the
+ slaves by going to the Mesos Web UI link printed at the end of the
+ script (`http://<master-hostname>:8080`).
+
+You can also run `./spark-ec2 --help` to see more usage options. The
+following options are worth pointing out:
+
+- `--instance-type=<INSTANCE_TYPE>` can be used to specify an EC2
+instance type to use. For now, the script only supports 64-bit instance
+types, and the default type is `m1.large` (which has 2 cores and 7.5 GB
+RAM). Refer to the Amazon pages about [EC2 instance
+types](http://aws.amazon.com/ec2/instance-types) and [EC2
+pricing](http://aws.amazon.com/ec2/#pricing) for information about other
+instance types.
+- `--zone=<EC2_ZONE>` can be used to specify an EC2 availability zone
+to launch instances in. Sometimes, you will get an error because there
+is not enough capacity in one zone, and you should try to launch in
+another. This happens mostly with the `m1.large` instance types;
+extra-large (both `m1.xlarge` and `c1.xlarge`) instances tend to be more
+available.
+- `--ebs-vol-size=GB` will attach an EBS volume with a given amount
+ of space to each node so that you can have a persistent HDFS cluster
+ on your nodes across cluster restarts (see below).
+- If one of your launches fails due to e.g. not having the right
+permissions on your private key file, you can run `launch` with the
+`--resume` option to restart the setup process on an existing cluster.
+
+# Running Jobs
+
+- Go into the `ec2` directory in the release of Spark you downloaded.
+- Run `./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>` to
+ SSH into the cluster, where `<keypair>` and `<key-file>` are as
+ above. (This is just for convenience; you could also use
+ the EC2 console.)
+- To deploy code or data within your cluster, you can log in and use the
+ provided script `~/mesos-ec2/copy-dir`, which,
+ given a directory path, RSYNCs it to the same location on all the slaves.
+- If your job needs to access large datasets, the fastest way to do
+ that is to load them from Amazon S3 or an Amazon EBS device into an
+ instance of the Hadoop Distributed File System (HDFS) on your nodes.
+ The `spark-ec2` script already sets up a HDFS instance for you. It's
+ installed in `/root/ephemeral-hdfs`, and can be accessed using the
+ `bin/hadoop` script in that directory. Note that the data in this
+ HDFS goes away when you stop and restart a machine.
+- There is also a *persistent HDFS* instance in
+ `/root/presistent-hdfs` that will keep data across cluster restarts.
+ Typically each node has relatively little space of persistent data
+ (about 3 GB), but you can use the `--ebs-vol-size` option to
+ `spark-ec2` to attach a persistent EBS volume to each node for
+ storing the persistent HDFS.
+- Finally, if you get errors while running your jobs, look at the slave's logs
+ for that job using the Mesos web UI (`http://<master-hostname>:8080`).
+
+# Configuration
+
+You can edit `/root/spark/conf/spark-env.sh` on each machine to set Spark configuration options, such
+as JVM options and, most crucially, the amount of memory to use per machine (`SPARK_MEM`).
+This file needs to be copied to **every machine** to reflect the change. The easiest way to do this
+is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master,
+then run `~/mesos-ec2/copy-dir /root/spark/conf` to RSYNC it to all the workers.
+
+The [configuration guide](configuration.html) describes the available configuration options.
+
+# Terminating a Cluster
+
+***Note that there is no way to recover data on EC2 nodes after shutting
+them down! Make sure you have copied everything important off the nodes
+before stopping them.***
+
+- Go into the `ec2` directory in the release of Spark you downloaded.
+- Run `./spark-ec2 destroy <cluster-name>`.
+
+# Pausing and Restarting Clusters
+
+The `spark-ec2` script also supports pausing a cluster. In this case,
+the VMs are stopped but not terminated, so they
+***lose all data on ephemeral disks*** but keep the data in their
+root partitions and their `persistent-hdfs`. Stopped machines will not
+cost you any EC2 cycles, but ***will*** continue to cost money for EBS
+storage.
+
+- To stop one of your clusters, go into the `ec2` directory and run
+`./spark-ec2 stop <cluster-name>`.
+- To restart it later, run
+`./spark-ec2 -i <key-file> start <cluster-name>`.
+- To ultimately destroy the cluster and stop consuming EBS space, run
+`./spark-ec2 destroy <cluster-name>` as described in the previous
+section.
+
+# Limitations
+
+- `spark-ec2` currently only launches machines in the US-East region of EC2.
+ It should not be hard to make it launch VMs in other zones, but you will need
+ to create your own AMIs in them.
+- Support for "cluster compute" nodes is limited -- there's no way to specify a
+ locality group. However, you can launch slave nodes in your
+ `<clusterName>-slaves` group manually and then use `spark-ec2 launch
+ --resume` to start a cluster with them.
+- Support for spot instances is limited.
+
+If you have a patch or suggestion for one of these limitations, feel free to
+[contribute](contributing-to-spark.html) it!
+
+# Using a Newer Spark Version
+
+The Spark EC2 machine images may not come with the latest version of Spark. To use a newer version, you can run `git pull` to pull in `/root/spark` to pull in the latest version of Spark from `git`, and build it using `sbt/sbt compile`. You will also need to copy it to all the other nodes in the cluster using `~/mesos-ec2/copy-dir /root/spark`.
+
+# Accessing Data in S3
+
+Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<id>:<secret>@<bucket>/path`, where `<id>` is your Amazon access key ID and `<secret>` is your Amazon secret access key. Note that you should escape any `/` characters in the secret key as `%2F`. Full instructions can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
+
+In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.