spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-12107][EC2] Update spark-ec2 versions	Nicholas Chammas	2015-12-03	1	-3/+9
\| \| \| \| \| \| \| \| \| \|	I haven't created a JIRA. If we absolutely need one I'll do it, but I'm fine with not getting mentioned in the release notes if that's the only purpose it'll serve. cc marmbrus - We should include this in 1.6-RC2 if there is one. I can open a second PR against branch-1.6 if necessary. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #10109 from nchammas/spark-ec2-versions.
*	[SPARK-11991] fixes	Jeremy Derr	2015-11-26	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	If `--private-ips` is required but not provided, spark_ec2.py may behave inappropriately, including attempting to ssh to localhost in attempts to verify ssh connectivity to the cluster. This fixes that behavior by raising a `UsageError` exception if `get_dns_name` is unable to determine a hostname as a result. Author: Jeremy Derr <jcderr@radius.com> Closes #9975 from jcderr/SPARK-11991/ec_spark.py_hostname_check.
*	[SPARK-11837][EC2] python3 compatibility for launching ec2 m3 instances	Mortada Mehyar	2015-11-23	1	-1/+1
\| \| \| \| \| \| \| \|	this currently breaks for python3 because `string` module doesn't have `letters` anymore, instead `ascii_letters` should be used Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #9797 from mortada/python3_fix.
*	[SPARK-10532][EC2] Added --profile option to specify the name of profile	teramonagi	2015-10-29	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	"profiles" give us the way that you can specify the set of credentials you want to use when you initialize a connection to AWS. You can keep multiple sets of credentials in the same credentials files using different profile names. For example, you can use --profile option to do that when you use "aws cli tool". http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html Author: teramonagi <teramonagi@gmail.com> Closes #8696 from teramonagi/SPARK-10532.
*	Add 1.5 to master branch EC2 scripts	Shivaram Venkataraman	2015-09-10	1	-2/+6
\| \| \| \| \| \| \| \|	This change brings it to par with `branch-1.5` (and 1.5.0 release) Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #8704 from shivaram/ec2-1.5-update.
*	[SPARK-9562] Change reference to amplab/spark-ec2 from mesos/	Shivaram Venkataraman	2015-08-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	cc srowen pwendell nchammas Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #7899 from shivaram/spark-ec2-move and squashes the following commits: 7cc22c9 [Shivaram Venkataraman] Change reference to amplab/spark-ec2 from mesos/
*	[EC2] Cosmetic fix for usage of spark-ec2 --ebs-vol-num option	Kenichi Maehashi	2015-07-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The last line of the usage seems ugly. ``` $ spark-ec2 --help <snip> --ebs-vol-num=EBS_VOL_NUM Number of EBS volumes to attach to each node as /vol[x]. The volumes will be deleted when the instances terminate. Only possible on EBS-backed AMIs. EBS volumes are only attached if --ebs-vol-size > 0.Only support up to 8 EBS volumes. ``` After applying this patch: ``` $ spark-ec2 --help <snip> --ebs-vol-num=EBS_VOL_NUM Number of EBS volumes to attach to each node as /vol[x]. The volumes will be deleted when the instances terminate. Only possible on EBS-backed AMIs. EBS volumes are only attached if --ebs-vol-size > 0. Only support up to 8 EBS volumes. ``` As this is a trivial thing I didn't create JIRA for this. Author: Kenichi Maehashi <webmaster@kenichimaehashi.com> Closes #7632 from kmaehashi/spark-ec2-cosmetic-fix and squashes the following commits: 526c118 [Kenichi Maehashi] cosmetic fix for spark-ec2 --ebs-vol-num option usage
*	[SPARK-8596] Add module for rstudio link to spark	Vincent D. Warmerdam	2015-07-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	shivaram, added module for rstudio install Author: Vincent D. Warmerdam <vincentwarmerdam@gmail.com> Closes #7366 from koaning/rstudio-install and squashes the following commits: e47c2da [Vincent D. Warmerdam] added rstudio module
*	[SPARK-8863] [EC2] Check aws access key from aws credentials if there is no ↵	JPark	2015-07-09	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	boto config 'spark_ec2.py' use boto to control ec2. And boto can support '~/.aws/credentials' which is AWS CLI default configuration file. We can check this information from ref of boto. "A boto config file is a text file formatted like an .ini configuration file that specifies values for options that control the behavior of the boto library. In Unix/Linux systems, on startup, the boto library looks for configuration files in the following locations and in the following order: /etc/boto.cfg - for site-wide settings that all users on this machine will use (if profile is given) ~/.aws/credentials - for credentials shared between SDKs (if profile is given) ~/.boto - for user-specific settings ~/.aws/credentials - for credentials shared between SDKs ~/.boto - for user-specific settings" * ref of boto: http://boto.readthedocs.org/en/latest/boto_config_tut.html * ref of aws cli : http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html However 'spark_ec2.py' only check boto config & environment variable even if there is '~/.aws/credentials', and 'spark_ec2.py' is terminated. So I changed to check '~/.aws/credentials'. cc rxin Jira : https://issues.apache.org/jira/browse/SPARK-8863 Author: JPark <JPark@JPark.me> Closes #7252 from JuhongPark/master and squashes the following commits: 23c5792 [JPark] Check aws access key from aws credentials if there is no boto config
*	[SPARK-8902] Correctly print hostname in error	Daniel Darabos	2015-07-09	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	With "+" the strings are separate expressions, and format() is called on the last string before concatenation. (So substitution does not happen.) Without "+" the string literals are merged first by the parser, so format() is called on the complete string. Should I make a JIRA for this? Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #7288 from darabos/patch-2 and squashes the following commits: be0d3b7 [Daniel Darabos] Correctly print hostname in error
*	[SPARK-8821] [EC2] Switched to binary mode for file reading	Simon Hafner	2015-07-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Otherwise the script will crash with - Downloading boto... Traceback (most recent call last): File "ec2/spark_ec2.py", line 148, in <module> setup_external_libs(external_libs) File "ec2/spark_ec2.py", line 128, in setup_external_libs if hashlib.md5(tar.read()).hexdigest() != lib["md5"]: File "/usr/lib/python3.4/codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte In case of an utf8 env setting. Author: Simon Hafner <hafnersimon@gmail.com> Closes #7215 from reactormonk/branch-1.4 and squashes the following commits: e86957a [Simon Hafner] [SPARK-8821] [EC2] Switched to binary mode (cherry picked from commit 83a621a5a8f8a2991c4cfa687279589e5c623d46) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
*	[SPARK-8596] [EC2] Added port for Rstudio	Vincent D. Warmerdam	2015-06-28	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This would otherwise need to be set manually by R users in AWS. https://issues.apache.org/jira/browse/SPARK-8596 Author: Vincent D. Warmerdam <vincentwarmerdam@gmail.com> Author: vincent <vincentwarmerdam@gmail.com> Closes #7068 from koaning/rstudio-port-number and squashes the following commits: ac8100d [vincent] Update spark_ec2.py ce6ad88 [Vincent D. Warmerdam] added port number for rstudio
*	[SPARK-8576] Add spark-ec2 options to set IAM roles and instance-initiated ↵	Nicholas Chammas	2015-06-24	1	-21/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	shutdown behavior Both of these options are useful when spark-ec2 is being used as part of an automated pipeline and the engineers want to minimize the need to pass around AWS keys for access to things like S3 (keys are replaced by the IAM role) and to be able to launch a cluster that can terminate itself cleanly. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #6962 from nchammas/additional-ec2-options and squashes the following commits: fcf252e [Nicholas Chammas] PEP8 fixes efba9ee [Nicholas Chammas] add help for --instance-initiated-shutdown-behavior 598aecf [Nicholas Chammas] option to launch instances into IAM role 2743632 [Nicholas Chammas] add option for instance initiated shutdown
*	[SPARK-8482] Added M4 instances to the list.	Pradeep Chhetri	2015-06-22	1	-2/+14
\| \| \| \| \| \| \| \| \| \| \|	AWS recently added M4 instances (https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bonus-price-reduction-on-m3-c4/). Author: Pradeep Chhetri <pradeep.chhetri89@gmail.com> Closes #6899 from pradeepchhetri/master and squashes the following commits: 4f4ea79 [Pradeep Chhetri] Added t2.large instance 3d2bb6c [Pradeep Chhetri] Added M4 instances to the list
*	[SPARK-8429] [EC2] Add ability to set additional tags	Stefano Parmesan	2015-06-22	1	-8/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add the `--additional-tags` parameter that allows to set additional tags to all the created instances (masters and slaves). The user can specify multiple tags by separating them with a comma (`,`), while each tag name and value should be separated by a colon (`:`); for example, `Task:MySparkProject,Env:production` would add two tags, `Task` and `Env`, with the given values. Author: Stefano Parmesan <s.parmesan@gmail.com> Closes #6857 from armisael/patch-1 and squashes the following commits: c5ac92c [Stefano Parmesan] python style (pep8) 8e614f1 [Stefano Parmesan] Set multiple tags in a single request bfc56af [Stefano Parmesan] Address SPARK-7900 by inceasing sleep time daf8615 [Stefano Parmesan] Add ability to set additional tags
*	[SPARK-8322] [EC2] Added spark 1.4.0 into the VALID_SPARK_VERSIONS and…	Mark Smith	2015-06-12	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	… SPARK_TACHYON_MAP This contribution is my original work and I license the work to the project under the project's open source license. Author: Mark Smith <mark.smith@bronto.com> Closes #6776 from markmsmith/SPARK-8322 and squashes the following commits: d744244 [Mark Smith] [SPARK-8322][EC2] Fixed tachyon mapp entry to point to 0.6.4 e4f14d3 [Mark Smith] [SPARK-8322][EC2] Added spark 1.4.0 into the VALID_SPARK_VERSIONS and SPARK_TACHYON_MAP
*	[SPARK-8310] [EC2] Updates the master branch EC2 versions	Shivaram Venkataraman	2015-06-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Will send another PR for `branch-1.4` Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6764 from shivaram/SPARK-8310 and squashes the following commits: d8cd3b3 [Shivaram Venkataraman] This updates the master branch EC2 versions
*	[SPARK-3674] [EC2] Clear SPARK_WORKER_INSTANCES when using YARN	Shivaram Venkataraman	2015-06-03	1	-3/+10
\| \| \| \| \| \| \| \| \| \| \|	cc andrewor14 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6424 from shivaram/spark-worker-instances-yarn-ec2 and squashes the following commits: db244ae [Shivaram Venkataraman] Make Python Lint happy 0593d1b [Shivaram Venkataraman] Clear SPARK_WORKER_INSTANCES when using YARN
*	[SPARK-3674] YARN support in Spark EC2	Shivaram Venkataraman	2015-05-26	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	This corresponds to https://github.com/mesos/spark-ec2/pull/116 in the spark-ec2 repo. The only changes required on the spark_ec2.py script is to open the RM port. cc andrewor14 Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6376 from shivaram/spark-ec2-yarn and squashes the following commits: 961504a [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into spark-ec2-yarn 152c94c [Shivaram Venkataraman] Open 8088 for YARN in EC2
*	[SPARK-7806][EC2] Fixes that allow the spark_ec2.py tool to run with Python3	meawoppl	2015-05-26	1	-5/+9
\| \| \| \| \| \| \| \| \| \|	I have used this script to launch, destroy, start, and stop clusters successfully. Author: meawoppl <meawoppl@gmail.com> Closes #6336 from meawoppl/py3ec2spark and squashes the following commits: 2e87046 [meawoppl] Py3 compat fixes.
*	[SPARK-6246] [EC2] fixed support for more than 100 nodes	alyaxey	2015-05-19	1	-1/+5
\| \| \| \| \| \| \| \| \| \|	This is a small fix. But it is important for amazon users because as the ticket states, "spark-ec2 can't handle clusters with > 100 nodes" now. Author: alyaxey <oleksii.sliusarenko@grammarly.com> Closes #6267 from alyaxey/ec2_100_nodes_fix and squashes the following commits: 1e0d747 [alyaxey] [SPARK-6246] fixed support for more than 100 nodes
*	[MINOR] Add 1.3, 1.3.1 to master branch EC2 scripts	Shivaram Venkataraman	2015-05-17	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \|	cc pwendell P.S: I can't believe this was outdated all along ? Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6215 from shivaram/update-ec2-map and squashes the following commits: ae3937a [Shivaram Venkataraman] Add 1.3, 1.3.1 to master branch EC2 scripts
*	updated ec2 instance types	Brendan Collins	2015-05-08	1	-23/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I needed to run some d2 instances, so I updated the spark_ec2.py accordingly Author: Brendan Collins <bcollins@blueraster.com> Closes #6014 from brendancol/ec2-instance-types-update and squashes the following commits: d7b4191 [Brendan Collins] Merge branch 'ec2-instance-types-update' of github.com:brendancol/spark into ec2-instance-types-update 6366c45 [Brendan Collins] added back cc1.4xlarge fc2931f [Brendan Collins] updated ec2 instance types 80c2aa6 [Brendan Collins] vertically aligned whitespace 85c6236 [Brendan Collins] vertically aligned whitespace 1657c26 [Brendan Collins] updated ec2 instance types
*	[SPARK-4897] [PySpark] Python 3 support	Davies Liu	2015-04-16	1	-131/+131
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR update PySpark to support Python 3 (tested with 3.4). Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped. TODO: ec2/spark-ec2.py is not fully tested with python3. Author: Davies Liu <davies@databricks.com> Author: twneale <twneale@gmail.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #5173 from davies/python3 and squashes the following commits: d7d6323 [Davies Liu] fix tests 6c52a98 [Davies Liu] fix mllib test 99e334f [Davies Liu] update timeout b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 cafd5ec [Davies Liu] adddress comments from @mengxr bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 179fc8d [Davies Liu] tuning flaky tests 8c8b957 [Davies Liu] fix ResourceWarning in Python 3 5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 4006829 [Davies Liu] fix test 2fc0066 [Davies Liu] add python3 path 71535e9 [Davies Liu] fix xrange and divide 5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 ed498c8 [Davies Liu] fix compatibility with python 3 820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 ad7c374 [Davies Liu] fix mllib test and warning ef1fc2f [Davies Liu] fix tests 4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 59bb492 [Davies Liu] fix tests 1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 ca0fdd3 [Davies Liu] fix code style 9563a15 [Davies Liu] add imap back for python 2 0b1ec04 [Davies Liu] make python examples work with Python 3 d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 a716d34 [Davies Liu] test with python 3.4 f1700e8 [Davies Liu] fix test in python3 671b1db [Davies Liu] fix test in python3 692ff47 [Davies Liu] fix flaky test 7b9699f [Davies Liu] invalidate import cache for Python 3.3+ 9c58497 [Davies Liu] fix kill worker 309bfbf [Davies Liu] keep compatibility 5707476 [Davies Liu] cleanup, fix hash of string in 3.3+ 8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3 f53e1f0 [Davies Liu] fix tests 70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3 a39167e [Davies Liu] support customize class in __main__ 814c77b [Davies Liu] run unittests with python 3 7f4476e [Davies Liu] mllib tests passed d737924 [Davies Liu] pass ml tests 375ea17 [Davies Liu] SQL tests pass 6cc42a9 [Davies Liu] rename 431a8de [Davies Liu] streaming tests pass 78901a7 [Davies Liu] fix hash of serializer in Python 3 24b2f2e [Davies Liu] pass all RDD tests 35f48fe [Davies Liu] run future again 1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py 6e3c21d [Davies Liu] make cloudpickle work with Python3 2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run 1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out 7354371 [twneale] buffer --> memoryview I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work. b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?). f40d925 [twneale] xrange --> range e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206 79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper 2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3 854be27 [Josh Rosen] Run `futurize` on Python code: 7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
*	[SPARK-5242]: Add --private-ips flag to EC2 script	Michelangelo D'Agostino	2015-04-08	1	-17/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The `spark_ec2.py` script currently references the `ip_address` and `public_dns_name` attributes of an instance. On private networks, these fields aren't set, so we have problems. This PR introduces a `--private-ips` flag that instead refers to the `private_ip_address` attribute in both cases. Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com> Closes #5244 from mdagost/ec2_private_nets and squashes the following commits: b684c67 [Michelangelo D'Agostino] STY: A few python lint changes. a4a2eac [Michelangelo D'Agostino] ENH: Fix IP's typo and refactor conditional logic into functions. c004604 [Michelangelo D'Agostino] ENH: Add --private-ips flag.
*	[SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py	Matt Aasted	2015-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	The spark_ec2.py script uses public_dns_name everywhere in the script except for testing ssh availability, which is done using the public ip address of the instances. This breaks the script for users who are deploying the cluster with a private-network-only security group. The fix is to use public_dns_name in the remaining place. Author: Matt Aasted <aasted@twitch.tv> Closes #5302 from aasted/master and squashes the following commits: 60cf6ee [Matt Aasted] [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
*	[EC2] [SPARK-6600] Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway	Florian Verhein	2015-04-01	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	Authorizes incoming access to master on the ports required to use the hadoop hdfs nfs gateway from outside the cluster. Author: Florian Verhein <florian.verhein@gmail.com> Closes #5257 from florianverhein/master and squashes the following commits: 72a586a [Florian Verhein] [EC2] [SPARK-6600] initial impl
*	[SPARK-6219] [Build] Check that Python code compiles	Nicholas Chammas	2015-03-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR expands the Python lint checks so that they check for obvious compilation errors in our Python code. For example: ``` $ ./dev/lint-python Python lint checks failed. Compiling ./ec2/spark_ec2.py ... File "./ec2/spark_ec2.py", line 618 return (master_nodes,, slave_nodes) ^ SyntaxError: invalid syntax ./ec2/spark_ec2.py:618:25: E231 missing whitespace after ',' ./ec2/spark_ec2.py:1117:101: E501 line too long (102 > 100 characters) ``` This PR also bumps up the version of `pep8`. It ignores new types of checks introduced by that version bump while fixing problems missed by the older version of `pep8` we were using. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4941 from nchammas/compile-spark-ec2 and squashes the following commits: 75e31d8 [Nicholas Chammas] upgrade pep8 + check compile b33651c [Nicholas Chammas] PEP8 line length
*	[SPARK-6402][DOC] - Remove some refererences to shark in docs and ec2	Pierre Borckmans	2015-03-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EC2 script and job scheduling documentation still refered to Shark. I removed these references. I also removed a remaining `SHARK_VERSION` variable from `ec2-variables.sh`. Author: Pierre Borckmans <pierre.borckmans@realimpactanalytics.com> Closes #5083 from pierre-borckmans/remove_refererences_to_shark_in_docs and squashes the following commits: 4e90ffc [Pierre Borckmans] Removed deprecated SHARK_VERSION caea407 [Pierre Borckmans] Remove shark reference from ec2 script doc 196c744 [Pierre Borckmans] Removed references to Shark
*	[SPARK-6186] [EC2] Make Tachyon version configurable in EC2 deployment script	cheng chang	2015-03-10	2	-1/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR comes from Tachyon community to solve the issue: https://tachyon.atlassian.net/browse/TACHYON-11 An accompanying PR is in mesos/spark-ec2: https://github.com/mesos/spark-ec2/pull/101 Author: cheng chang <myairia@gmail.com> Closes #4901 from uronce-cc/master and squashes the following commits: 313aa36 [cheng chang] minor re-wording fd2a48e [cheng chang] Remove Tachyon when deploying through git hash 1d53c5c [cheng chang] add default value to --tachyon-version 6f8887e [cheng chang] make tachyon version configurable
*	[SPARK-6191] [EC2] Generalize ability to download libs	Nicholas Chammas	2015-03-10	1	-28/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now we have a method to specifically download boto. This PR generalizes it so it's easy to download additional libraries if we want. For example, adding new external libraries for spark-ec2 is now as simple as: ```python external_libs = [ { "name": "boto", "version": "2.34.0", "md5": "5556223d2d0cc4d06dd4829e671dcecd" }, { "name": "PyYAML", "version": "3.11", "md5": "f50e08ef0fe55178479d3a618efe21db" }, { "name": "argparse", "version": "1.3.0", "md5": "9bcf7f612190885c8c85e30ba41db3c7" } ] ``` Likely use cases: * Downloading PyYAML to allow spark-ec2 configs to be persisted as a YAML file. ([SPARK-925](https://issues.apache.org/jira/browse/SPARK-925)) * Downloading argparse to clean up / modernize our option parsing. First run output, with PyYAML and argparse added just for demonstration purposes: ```shell $ ./spark-ec2 --version Downloading external libraries that spark-ec2 needs from PyPI to /path/to/spark/ec2/lib... This should be a one-time operation. - Downloading boto... - Finished downloading boto. - Downloading PyYAML... - Finished downloading PyYAML. - Downloading argparse... - Finished downloading argparse. spark-ec2 1.2.1 ``` Output thereafter: ```shell $ ./spark-ec2 --version spark-ec2 1.2.1 ``` Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4919 from nchammas/setup-ec2-libs and squashes the following commits: a077955 [Nicholas Chammas] print default region c95fb7d [Nicholas Chammas] to docstring 5448845 [Nicholas Chammas] remove libs added for demo purposes 60d8c23 [Nicholas Chammas] generalize ability to download libs
*	[EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster ↵	Theodore Vasiloudis	2015-03-09	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with default arguments As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838. When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines. This solves this by getting the instance types from the existing instances, and overwriting the default options. EDIT: Further clarification of the issue: In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large. This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue [5838](https://issues.apache.org/jira/browse/SPARK-5838), where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors. Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well. Author: Theodore Vasiloudis <thvasilo@users.noreply.github.com> Author: Theodore Vasiloudis <tvas@sics.se> Closes #4916 from thvasilo/SPARK-6188]-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments and squashes the following commits: 6705b98 [Theodore Vasiloudis] Added comment to clarify setting master instance type to the empty string. a3d29fe [Theodore Vasiloudis] More trailing whitespace 7b32429 [Theodore Vasiloudis] Removed trailing whitespace 3ebd52a [Theodore Vasiloudis] Make sure that the instance type is correct when relaunching a cluster.
*	[SPARK-6193] [EC2] Push group filter up to EC2	Nicholas Chammas	2015-03-08	1	-37/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When looking for a cluster, spark-ec2 currently pulls down [info for all instances](https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620) and filters locally. When working on an AWS account with hundreds of active instances, this step alone can take over 10 seconds. This PR improves how spark-ec2 searches for clusters by pushing the filter up to EC2. Basically, the problem (and solution) look like this: ```python >>> timeit.timeit('blah = conn.get_all_reservations()', setup='from __main__ import conn', number=10) 116.96390509605408 >>> timeit.timeit('blah = conn.get_all_reservations(filters={"instance.group-name": ["my-cluster-master"]})', setup='from __main__ import conn', number=10) 4.629754066467285 ``` Translated to a user-visible action, this looks like (against an AWS account with ~200 active instances): ```shell # master $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)' ... 3 loops, best of 3: 9.83 sec per loop # this PR $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)' ... 3 loops, best of 3: 1.47 sec per loop ``` This PR also refactors `get_existing_cluster()` to make it, I hope, simpler. Finally, this PR fixes some minor grammar issues related to printing status to the user. :tophat: :clap: Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4922 from nchammas/get-existing-cluster-faster and squashes the following commits: 18802f1 [Nicholas Chammas] ignore shutting-down f2a5b9f [Nicholas Chammas] fix grammar d96a489 [Nicholas Chammas] push group filter up to EC2
*	[SPARK-5641] [EC2] Allow spark_ec2.py to copy arbitrary files to cluster	Florian Verhein	2015-03-07	1	-0/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Give users an easy way to rcp a directory structure to the master's / as part of the cluster launch, at a useful point in the workflow (before setup.sh is called on the master). This is an alternative approach to meeting requirements discussed in https://github.com/apache/spark/pull/4487 Author: Florian Verhein <florian.verhein@gmail.com> Closes #4583 from florianverhein/master and squashes the following commits: 49dee88 [Florian Verhein] removed addition of trailing / in rsync to give user this option, added documentation in help 7b8e3d8 [Florian Verhein] remove unused args 87d922c [Florian Verhein] [SPARK-5641] [EC2] implement --deploy-root-dir
*	[EC2] Reorder print statements on termination	Nicholas Chammas	2015-03-07	1	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The PR reorders some print statements slightly on cluster termination so that they read better. For example, from this: ``` Are you sure you want to destroy the cluster spark-cluster-test? The following instances will be terminated: Searching for existing cluster spark-cluster-test in region us-west-2... Found 1 master(s), 2 slaves > ... ALL DATA ON ALL NODES WILL BE LOST!! Destroy cluster spark-cluster-test (y/N): ``` To this: ``` Searching for existing cluster spark-cluster-test in region us-west-2... Found 1 master(s), 2 slaves The following instances will be terminated: > ... ALL DATA ON ALL NODES WILL BE LOST!! Are you sure you want to destroy the cluster spark-cluster-test? (y/N) ``` Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4932 from nchammas/termination-print-order and squashes the following commits: c23711d [Nicholas Chammas] reorder prints on termination
*	[SPARK-5335] Fix deletion of security groups within a VPC	Vladimir Grigor	2015-02-12	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Please see https://issues.apache.org/jira/browse/SPARK-5335. The fix itself is in e58a8b01a8bedcbfbbc6d04b1c1489255865cf87 commit. Two earlier commits are fixes of another VPC related bug waiting to be merged. I should have created former bug fix in own branch then this fix would not have former fixes. :( This code is released under the project's license. Author: Vladimir Grigor <vladimir@kiosked.com> Author: Vladimir Grigor <vladimir@voukka.com> Closes #4122 from voukka/SPARK-5335_delete_sg_vpc and squashes the following commits: 090dca9 [Vladimir Grigor] fixes as per review: removed printing of group_id and added comment 730ec05 [Vladimir Grigor] fix for SPARK-5335: Destroying cluster in VPC with "--delete-groups" fails to remove security groups
*	[EC2] Update default Spark version to 1.2.1	Katsunori Kanda	2015-02-12	1	-1/+2
\| \| \| \| \| \| \| \|	Author: Katsunori Kanda <potix2@gmail.com> Closes #4566 from potix2/ec2-update-version-1-2-1 and squashes the following commits: 77e7840 [Katsunori Kanda] [EC2] Update default Spark version to 1.2.1
*	[SPARK-5668] Display region in spark_ec2.py get_existing_cluster()	Miguel Peralvo	2015-02-10	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	Show the region for the different messages displayed by get_existing_cluster(): The search, found and error messages. Author: Miguel Peralvo <miguel.peralvo@gmail.com> Closes #4457 from MiguelPeralvo/patch-2 and squashes the following commits: a5514c8 [Miguel Peralvo] Update spark_ec2.py 0a837b0 [Miguel Peralvo] Update spark_ec2.py 3923f36 [Miguel Peralvo] Update spark_ec2.py 4ecd9f9 [Miguel Peralvo] [SPARK-5668] Display region in spark_ec2.py get_existing_cluster()
*	[SPARK-1805] [EC2] Validate instance types	Nicholas Chammas	2015-02-10	1	-51/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Addresses [SPARK-1805](https://issues.apache.org/jira/browse/SPARK-1805), though doesn't resolve it completely. Error out quickly if the user asks for the master and slaves to have different AMI virtualization types, since we don't currently support that. In addition to that, we print warnings if the inputted instance types are not recognized, though I would prefer if we errored out. Elsewhere in the script it seems [we allow unrecognized instance types](https://github.com/apache/spark/blob/5de14cc2763a8211f77eeb55940dec025822eb78/ec2/spark_ec2.py#L331), though I think we should remove that. It's messy, but it should serve us until we enhance spark-ec2 to support clusters with mixed virtualization types. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4455 from nchammas/ec2-master-slave-different-virtualization and squashes the following commits: ce28609 [Nicholas Chammas] fix style b0adba0 [Nicholas Chammas] validate input instance types
*	[SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of ↵	Florian Verhein	2015-02-09	1	-5/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark_ec2.py and by extension, the ami-list Useful for using alternate spark-ec2 repos or branches. Author: Florian Verhein <florian.verhein@gmail.com> Closes #4385 from florianverhein/master and squashes the following commits: 7e2b4be [Florian Verhein] [SPARK-5611] [EC2] typo 8b653dc [Florian Verhein] [SPARK-5611] [EC2] Enforce only supporting spark-ec2 forks from github, log improvement bc4b0ed [Florian Verhein] [SPARK-5611] allow spark-ec2 repos with different names 8b5c551 [Florian Verhein] improve option naming, fix logging, fix lint failing, add guard to enforce spark-ec2 7724308 [Florian Verhein] [SPARK-5611] [EC2] fixes b42b68c [Florian Verhein] [SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of spark_ec2.py
*	[SPARK-5473] [EC2] Expose SSH failures after status checks pass	Nicholas Chammas	2015-02-09	1	-12/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is. This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots. For example: ``` $ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test" Setting up security groups... Searching for existing cluster spark-test... Spark AMI: ami-35b1885c Launching instances... Launched 1 slaves in us-east-1c, regid = r-7dadd096 Launched master in us-east-1c, regid = r-fcadd017 Waiting for cluster to enter 'ssh-ready' state... Warning: SSH connection error. (This could be temporary.) Host: 127.0.0.1 SSH return code: 255 SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory. Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. Permission denied (publickey). ``` This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3C1422323829398-21381.postn3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`. This is a usability improvement that should be backported to 1.2. Resolves [SPARK-5473](https://issues.apache.org/jira/browse/SPARK-5473). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4262 from nchammas/expose-ssh-failure and squashes the following commits: 8bda6ed [Nicholas Chammas] default to print SSH output 2b92534 [Nicholas Chammas] show SSH output after status check pass
*	[SPARK-5366][EC2] Check the mode of private key	liuchang0812	2015-02-08	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Check the mode of private key file. Author: liuchang0812 <liuchang0812@gmail.com> Closes #4162 from Liuchang0812/ec2-script and squashes the following commits: fc37355 [liuchang0812] quota file name 01ed464 [liuchang0812] more output ce2a207 [liuchang0812] pep8 f44efd2 [liuchang0812] move code to real_main 8475a54 [liuchang0812] fix bug cd61a1a [liuchang0812] import stat c106cb2 [liuchang0812] fix trivis bug 89c9953 [liuchang0812] more output about checking private key 1177a90 [liuchang0812] remove commet 41188ab [liuchang0812] check the mode of private key
*	SPARK-5403: Ignore UserKnownHostsFile in SSH calls	Grzegorz Dubicki	2015-02-06	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	See https://issues.apache.org/jira/browse/SPARK-5403 Author: Grzegorz Dubicki <grzegorz.dubicki@gmail.com> Closes #4196 from grzegorz-dubicki/SPARK-5403 and squashes the following commits: a7d863f [Grzegorz Dubicki] Resolve start command hanging issue
*	[SPARK-4983] Insert waiting time before tagging EC2 instances	GenTang	2015-02-06	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The boto API doesn't support tag EC2 instances in the same call that launches them. We add a five-second wait so EC2 has enough time to propagate the information so that the tagging can succeed. Author: GenTang <gen.tang86@gmail.com> Author: Gen TANG <gen.tang86@gmail.com> Closes #3986 from GenTang/spark-4983 and squashes the following commits: 13e257d [Gen TANG] modification of comments 47f06755 [GenTang] print the information ab7a931 [GenTang] solve the issus spark-4983 by inserting waiting time 3179737 [GenTang] Revert "handling exceptions about adding tags to ec2" 6a8b53b [GenTang] Revert "the improvement of exception handling" 13e97a6 [GenTang] Revert "typo" 63fd360 [GenTang] typo 692fc2b [GenTang] the improvement of exception handling 6adcf6d [GenTang] handling exceptions about adding tags to ec2
*	[SPARK-5628] Add version option to spark-ec2	Nicholas Chammas	2015-02-06	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Every proper command line tool should include a `--version` option or something similar. This PR adds this to `spark-ec2` using the standard functionality provided by `optparse`. One thing we don't do here is follow the Python convention of setting `__version__`, since it seems awkward given how `spark-ec2` is laid out. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4414 from nchammas/spark-ec2-show-version and squashes the following commits: 914cab5 [Nicholas Chammas] add version info
*	[SPARK-5434] [EC2] Preserve spaces in EC2 path	Nicholas Chammas	2015-01-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fixes [SPARK-5434](https://issues.apache.org/jira/browse/SPARK-5434). Simple demonstration of the problem and the fix: ``` $ spacey_path="/path/with some/spaces" $ dirname $spacey_path usage: dirname path $ echo $? 1 $ dirname "$spacey_path" /path/with some $ echo $? 0 ``` Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4224 from nchammas/patch-1 and squashes the following commits: 960711a [Nicholas Chammas] [EC2] Preserve spaces in EC2 path
*	[SPARK-5122] Remove Shark from spark-ec2	Nicholas Chammas	2015-01-08	1	-34/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I moved the Spark-Shark version map [to the wiki](https://cwiki.apache.org/confluence/display/SPARK/Spark-Shark+version+mapping). This PR has a [matching PR in mesos/spark-ec2](https://github.com/mesos/spark-ec2/pull/89). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3939 from nchammas/remove-shark and squashes the following commits: 66e0841 [Nicholas Chammas] fix style ceeab85 [Nicholas Chammas] show default Spark GitHub repo 7270126 [Nicholas Chammas] validate Spark hashes db4935d [Nicholas Chammas] validate spark version upfront fc0d5b9 [Nicholas Chammas] remove Shark
*	[EC2] Update mesos/spark-ec2 branch to branch-1.3	Nicholas Chammas	2014-12-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Going forward, we'll use matching branch names across the mesos/spark-ec2 and apache/spark repositories, per [the discussion here](https://github.com/mesos/spark-ec2/pull/85#issuecomment-68069589). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3804 from nchammas/patch-2 and squashes the following commits: cd2c0d4 [Nicholas Chammas] [EC2] Update mesos/spark-ec2 branch to branch-1.3
*	[EC2] Update default Spark version to 1.2.0	Nicholas Chammas	2014-12-25	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \|	Now that 1.2.0 is out, let's update the default Spark version. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3793 from nchammas/patch-1 and squashes the following commits: 3255832 [Nicholas Chammas] add 1.2.0 version to Spark-Shark map ec0e904 [Nicholas Chammas] [EC2] Update default Spark version to 1.2.0
*	[SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi ↵	Josh Rosen	2014-12-19	3	-13/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instead of packaging it This patch upgrades `spark-ec2`'s Boto version to 2.34.0, since this is blocking several features. Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources. Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in `SPARK_EC2_DIR/lib`, similar to what we do in the `sbt/sbt` script. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster. By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users. I've tested this with Python 2.6, too. Author: Josh Rosen <joshrosen@databricks.com> Closes #3737 from JoshRosen/update-boto and squashes the following commits: 0aa43cc [Josh Rosen] Remove unused setup_standalone_cluster() method. f02935d [Josh Rosen] Enable Python deprecation warnings and fix one Boto warning: 587ae89 [Josh Rosen] [SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it