aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/ec2-scripts.md2
-rw-r--r--ec2/deploy.generic/root/spark-ec2/ec2-variables.sh2
-rwxr-xr-xec2/spark_ec2.py10
3 files changed, 13 insertions, 1 deletions
diff --git a/docs/ec2-scripts.md b/docs/ec2-scripts.md
index f5ac6d894e..b2ca6a9b48 100644
--- a/docs/ec2-scripts.md
+++ b/docs/ec2-scripts.md
@@ -156,6 +156,6 @@ If you have a patch or suggestion for one of these limitations, feel free to
# Accessing Data in S3
-Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. You will also need to set your Amazon security credentials, either by setting the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` before your program or through `SparkContext.hadoopConfiguration`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
+Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<bucket>/path`. To provide AWS credentials for S3 access, launch the Spark cluster with the option `--copy-aws-credentials`. Full instructions on S3 access using the Hadoop input libraries can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.
diff --git a/ec2/deploy.generic/root/spark-ec2/ec2-variables.sh b/ec2/deploy.generic/root/spark-ec2/ec2-variables.sh
index 3570891be8..740c267fd9 100644
--- a/ec2/deploy.generic/root/spark-ec2/ec2-variables.sh
+++ b/ec2/deploy.generic/root/spark-ec2/ec2-variables.sh
@@ -30,3 +30,5 @@ export HADOOP_MAJOR_VERSION="{{hadoop_major_version}}"
export SWAP_MB="{{swap}}"
export SPARK_WORKER_INSTANCES="{{spark_worker_instances}}"
export SPARK_MASTER_OPTS="{{spark_master_opts}}"
+export AWS_ACCESS_KEY_ID="{{aws_access_key_id}}"
+export AWS_SECRET_ACCESS_KEY="{{aws_secret_access_key}}" \ No newline at end of file
diff --git a/ec2/spark_ec2.py b/ec2/spark_ec2.py
index 5682e96aa8..abac71eaca 100755
--- a/ec2/spark_ec2.py
+++ b/ec2/spark_ec2.py
@@ -158,6 +158,9 @@ def parse_args():
parser.add_option(
"--additional-security-group", type="string", default="",
help="Additional security group to place the machines in")
+ parser.add_option(
+ "--copy-aws-credentials", action="store_true", default=False,
+ help="Add AWS credentials to hadoop configuration to allow Spark to access S3")
(opts, args) = parser.parse_args()
if len(args) != 2:
@@ -714,6 +717,13 @@ def deploy_files(conn, root_dir, opts, master_nodes, slave_nodes, modules):
"spark_master_opts": opts.master_opts
}
+ if opts.copy_aws_credentials:
+ template_vars["aws_access_key_id"] = conn.aws_access_key_id
+ template_vars["aws_secret_access_key"] = conn.aws_secret_access_key
+ else:
+ template_vars["aws_access_key_id"] = ""
+ template_vars["aws_secret_access_key"] = ""
+
# Create a temp directory in which we will place all the files to be
# deployed after we substitue template parameters in them
tmp_dir = tempfile.mkdtemp()