spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Revert "Preparing development version 1.2.2-SNAPSHOT"	Patrick Wendell	2015-01-26	1	-1/+1
\| \| \| \|	This reverts commit adfed7086f10fa8db4eeac7996c84cf98f625e9a.
*	Preparing development version 1.2.2-SNAPSHOT	Ubuntu	2015-01-27	1	-1/+1
\|
*	Preparing Spark release v1.2.1-rc1	Ubuntu	2015-01-27	1	-1/+1
\|
*	Updating versions for Spark 1.2.1	Patrick Wendell	2015-01-26	1	-1/+1
\|
*	SPARK-4147 [CORE] Reduce log4j dependency	Sean Owen	2015-01-26	1	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \|	Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. The only change is to push one half of the check in the original `if` condition inside. This is a trivial change, may or may not actually solve a problem, but I think it's all that makes sense to do for SPARK-4147. Author: Sean Owen <sowen@cloudera.com> Closes #4190 from srowen/SPARK-4147 and squashes the following commits: 4e99942 [Sean Owen] Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. (cherry picked from commit 54e7b456dd56c9e52132154e699abca87563465b) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	[SPARK-5355] use j.u.c.ConcurrentHashMap instead of TrieMap	Davies Liu	2015-01-26	3	-21/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	j.u.c.ConcurrentHashMap is more battle tested. cc rxin JoshRosen pwendell Author: Davies Liu <davies@databricks.com> Closes #4208 from davies/safe-conf and squashes the following commits: c2182dc [Davies Liu] address comments, fix tests 3a1d821 [Davies Liu] fix test da14ced [Davies Liu] Merge branch 'master' of github.com:apache/spark into safe-conf ae4d305 [Davies Liu] change to j.u.c.ConcurrentMap f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe (cherry picked from commit 142093179a4c40bdd90744191034de7b94a963ff) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	Revert "[SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress ↵	Andrew Or	2015-01-25	1	-3/+1
\| \| \| \| \| \|	file was renamed to completed file" This reverts commit 8f55beeb51e6ea72e63af3f276497f61dd24d09b.
*	[SPARK-5344][WebUI] HistoryServer cannot recognize that inprogress file was ↵	Kousuke Saruta	2015-01-25	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	renamed to completed file `FsHistoryProvider` tries to update application status but if `checkForLogs` is called before `.inprogress` file is renamed to completed file, the file is not recognized as completed. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #4132 from sarutak/SPARK-5344 and squashes the following commits: 9658008 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5344 d2c72b6 [Kousuke Saruta] Fixed update issue of FsHistoryProvider (cherry picked from commit 8f5c827b01026bf45fc774ed7387f11a941abea8) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
*	[SPARK-5402] log executor ID at executor-construction time	Ryan Williams	2015-01-25	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	also rename "slaveHostname" to "executorHostname" Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4195 from ryan-williams/exec and squashes the following commits: e60a7bb [Ryan Williams] log executor ID at executor-construction time (cherry picked from commit aea25482c370fbcf712a464501605bc16ee4ed5d) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala
*	[SPARK-5401] set executor ID before creating MetricsSystem	Ryan Williams	2015-01-25	2	-2/+6
\| \| \| \| \| \| \| \|	Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4194 from ryan-williams/metrics and squashes the following commits: 7c5a33f [Ryan Williams] set executor ID before creating MetricsSystem
*	[SPARK-5063] More helpful error messages for several invalid operations	Josh Rosen	2015-01-23	4	-14/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts. Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow. In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them). In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext. In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer). Author: Josh Rosen <joshrosen@databricks.com> Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits: a38774b [Josh Rosen] Fix spelling typo a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases. 2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility 3f0ea0c [Josh Rosen] Revert two unintentional formatting changes 8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2 8cff41a [Josh Rosen] IllegalStateException fix 6ef68d0 [Josh Rosen] Fix Python line length issues. 9f6a0b8 [Josh Rosen] Add improved error messages to PySpark. 13afd0f [Josh Rosen] SparkException -> IllegalStateException 8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063 b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD 99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts. 34833e8 [Josh Rosen] Add more descriptive error message. 57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD. 15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations (cherry picked from commit cef1f092a628ac20709857b4388bb10e0b5143b0) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-5355] make SparkConf thread-safe	Davies Liu	2015-01-21	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it. This PR changes SparkConf.settings to a thread-safe TrieMap. Author: Davies Liu <davies@databricks.com> Closes #4143 from davies/safe-conf and squashes the following commits: f8fa1cf [Davies Liu] change to TrieMap a1d769a [Davies Liu] make SparkConf thread-safe (cherry picked from commit 9bad062268676aaa66dcbddd1e0ab7f2d7742425) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	Make sure only owner can read / write to directories created for the job.	Marcelo Vanzin	2015-01-21	5	-53/+66
\| \| \| \| \| \| \|	Whenever a directory is created by the utility method, immediately restrict its permissions so that only the owner has access to its contents. Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-5006][Deploy]spark.port.maxRetries doesn't work	WangTaoTheTonic	2015-01-21	9	-21/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-5006 I think the issue is produced in https://github.com/apache/spark/pull/1777. Not digging mesos's backend yet. Maybe should add same logic either. Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3841 from WangTaoTheTonic/SPARK-5006 and squashes the following commits: 8cdf96d [WangTao] indent thing 2d86d65 [WangTaoTheTonic] fix line length 7cdfd98 [WangTaoTheTonic] fit for new HttpServer constructor 61a370d [WangTaoTheTonic] some minor fixes bc6e1ec [WangTaoTheTonic] rebase 67bcb46 [WangTaoTheTonic] put conf at 3rd position, modify suite class, add comments f450cd1 [WangTaoTheTonic] startServiceOnPort will use a SparkConf arg 29b751b [WangTaoTheTonic] rebase as ExecutorRunnableUtil changed to ExecutorRunnable 396c226 [WangTaoTheTonic] make the grammar more like scala 191face [WangTaoTheTonic] invalid value name 62ec336 [WangTaoTheTonic] spark.port.maxRetries doesn't work Conflicts: external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala
*	[SPARK-4569] Rename 'externalSorting' in Aggregator	Ilya Ganelin	2015-01-21	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Hi all - I've renamed the unhelpfully named variable and added a comment clarifying what's actually happening. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #3666 from ilganeli/SPARK-4569B and squashes the following commits: 1810394 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator e2d2092 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator d7cefec [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator 5b3f39c [Ilya Ganelin] [SPARK-4569] Rename in Aggregator
*	[SPARK-4759] Fix driver hanging from coalescing partitions	Andrew Or	2015-01-21	2	-16/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The driver hangs sometimes when we coalesce RDD partitions. See JIRA for more details and reproduction. This is because our use of empty string as default preferred location in `CoalescedRDDPartition` causes the `TaskSetManager` to schedule the corresponding task on host `""` (empty string). The intended semantics here, however, is that the partition does not have a preferred location, and the TSM should schedule the corresponding task accordingly. Author: Andrew Or <andrew@databricks.com> Closes #3633 from andrewor14/coalesce-preferred-loc and squashes the following commits: e520d6b [Andrew Or] Oops 3ebf8bd [Andrew Or] A few comments f370a4e [Andrew Or] Fix tests 2f7dfb6 [Andrew Or] Avoid using empty string as default preferred location (cherry picked from commit 4f93d0cabe5d1fc7c0fd0a33d992fd85df1fecb4) Signed-off-by: Andrew Or <andrew@databricks.com>
*	SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840...	Jacek Lewandowski	2015-01-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	... by Piotr Kolaczkowski) Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #4113 from jacek-lewandowski/SPARK-4660-master and squashes the following commits: a5e84ca [Jacek Lewandowski] SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840 by Piotr Kolaczkowski) (cherry picked from commit c93a57f0d6dc32b127aa68dbe4092ab0b22a9667) Signed-off-by: Patrick Wendell <patrick@databricks.com>
*	[SPARK-5201][CORE] deal with int overflow in the ParallelCollectionRDD.slice ↵	Ye Xianjin	2015-01-16	3	-16/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	method There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus. ``` sc.makeRDD(1 to (Int.MaxValue)).count // result = 0 sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1 sc.makeRDD(1 until (Int.MaxValue)).count // result = 2147483646 = Int.MaxValue - 1 ``` see https://github.com/apache/spark/pull/2874 for more details. This pr try to fix the overflow. However, There's another issue I don't address. ``` val largeRange = Int.MinValue to Int.MaxValue largeRange.length // throws java.lang.IllegalArgumentException: -2147483648 to 2147483647 by 1: seqs cannot contain more than Int.MaxValue elements. ``` So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr. srowen andrewor14 would you mind take a look at this pr? Author: Ye Xianjin <advancedxy@gmail.com> Closes #4002 from advancedxy/SPARk-5201 and squashes the following commits: 96265a1 [Ye Xianjin] Update slice method comment and some responding docs. e143d7a [Ye Xianjin] Update inclusive range check for splitting inclusive range. b3f5577 [Ye Xianjin] We can include the last element in the last slice in general for inclusive range, hence eliminate the need to check Int.MaxValue or Int.MinValue. 7d39b9e [Ye Xianjin] Convert the two cases pattern matching to one case. 651c959 [Ye Xianjin] rename sign to needsInclusiveRange. add some comments 196f8a8 [Ye Xianjin] Add test cases for ranges end with Int.MaxValue or Int.MinValue e66e60a [Ye Xianjin] Deal with inclusive and exclusive ranges in one case. If the range is inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive
*	[SPARK-5078] Optionally read from SPARK_LOCAL_HOSTNAME	Michael Armbrust	2015-01-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this is given to akka after doing a reverse DNS lookup. This makes it difficult to run spark in Docker. You can already change the hostname that is used programmatically, but it would be nice to be able to do this with an environment variable as well. Author: Michael Armbrust <michael@databricks.com> Closes #3893 from marmbrus/localHostnameEnv and squashes the following commits: 85045b6 [Michael Armbrust] Optionally read from SPARK_LOCAL_HOSTNAME (cherry picked from commit a3978f3e156e0ca67e978f1795b238ddd69ff9a6) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-5102][Core]subclass of MapStatus needs to be registered with Kryo	lianhuiwang	2015-01-12	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	CompressedMapStatus and HighlyCompressedMapStatus needs to be registered with Kryo, because they are subclass of MapStatus. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #4007 from lianhuiwang/SPARK-5102 and squashes the following commits: 9d2238a [lianhuiwang] remove register of MapStatus 05a285d [lianhuiwang] subclass of MapStatus needs to be registered with Kryo (cherry picked from commit ef9224e08010420b570c21a0b9208d22792a24fe) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-4951][Core] Fix the issue that a busy executor may be killed	zsxwing	2015-01-11	3	-45/+144
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A few changes to fix this issue: 1. Handle the case that receiving `SparkListenerTaskStart` before `SparkListenerBlockManagerAdded`. 2. Don't add `executorId` to `removeTimes` when the executor is busy. 3. Use `HashMap.retain` to safely traverse the HashMap and remove items. 4. Use the same lock in ExecutorAllocationManager and ExecutorAllocationListener to fix the race condition in `totalPendingTasks`. 5. Move the blocking codes out of the message processing code in YarnSchedulerActor. Author: zsxwing <zsxwing@gmail.com> Closes #3783 from zsxwing/SPARK-4951 and squashes the following commits: d51fa0d [zsxwing] Add comments 2e365ce [zsxwing] Remove expired executors from 'removeTimes' and add idle executors back when a new executor joins 49f61a9 [zsxwing] Eliminate duplicate executor registered warnings d4c4e9a [zsxwing] Minor fixes for the code style 05f6238 [zsxwing] Move the blocking codes out of the message processing code 105ba3a [zsxwing] Fix the race condition in totalPendingTasks d5c615d [zsxwing] Fix the issue that a busy executor may be killed (cherry picked from commit 6942b974adad396cba2799eac1fa90448cea4da7) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-4973][CORE] Local directory in the driver of client-mode continues ↵	Kousuke Saruta	2015-01-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	remaining even if application finished when external shuffle is enabled When we enables external shuffle service, local directories in the driver of client-mode continue remaining even if application has finished. I think local directories for drivers should be deleted. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3811 from sarutak/SPARK-4973 and squashes the following commits: ad944ab [Kousuke Saruta] Fixed DiskBlockManager to cleanup local directory if it's the driver 43770da [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 88feecd [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 d99718e [Kousuke Saruta] Fixed SparkSubmit.scala and DiskBlockManager.scala in order to delete local directories of the driver of local-mode when external shuffle service is enabled (cherry picked from commit a00af6bec57b8df8b286aaa5897232475aef441c) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-5132][Core]Correct stage Attempt Id key in stageInfofromJson	hushan[胡珊]	2015-01-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-5132: stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id Author: hushan[胡珊] <hushan@xiaomi.com> Closes #3932 from suyanNone/json-stage and squashes the following commits: 41419ab [hushan[胡珊]] Correct stage Attempt Id key in stageInfofromJson (cherry picked from commit d345ebebd554ac3faa4e870bd7800ed02e89da58) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environme...	Jongyoul Lee	2015-01-05	1	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	...nt at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask - See the Jira Issue for more details. Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3741 from jongyoul/SPARK-4465 and squashes the following commits: 46ad71e [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed unused import 3d6631f [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed comments and adjusted indentations 2343f13 [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask (cherry picked from commit 1c0e7ce056c79e1db96f85b8c56a479b8b043970) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs	Josh Rosen	2015-01-04	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <joshrosen@databricks.com> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs. (cherry picked from commit 939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
*	[SPARK-4787] Stop SparkContext if a DAGScheduler init error occurs	Dale	2015-01-04	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \|	Author: Dale <tigerquoll@outlook.com> Closes #3809 from tigerquoll/SPARK-4787 and squashes the following commits: 5661e01 [Dale] [SPARK-4787] Ensure that call to stop() doesn't lose the exception by using a finally block. 2172578 [Dale] [SPARK-4787] Stop context properly if an exception occurs during DAGScheduler initialization. (cherry picked from commit 3fddc9468fa50e7683caa973fec6c52e1132268d) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[HOTFIX] Bind web UI to ephemeral port in DriverSuite	Josh Rosen	2015-01-01	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The job launched by DriverSuite should bind the web UI to an ephemeral port, since it looks like port contention in this test has caused a large number of Jenkins failures when many builds are started simultaneously. Our tests already disable the web UI, but this doesn't affect subprocesses launched by our tests. In this case, I've opted to bind to an ephemeral port instead of disabling the UI because disabling features in this test may mask its ability to catch certain bugs. See also: e24d3a9 Author: Josh Rosen <joshrosen@databricks.com> Closes #3873 from JoshRosen/driversuite-webui-port and squashes the following commits: 48cd05c [Josh Rosen] [HOTFIX] Bind web UI to ephemeral port in DriverSuite. (cherry picked from commit 012839807c3dc6e7c8c41ac6e956d52a550bb031) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[HOTFIX] Disable Spark UI in SparkSubmitSuite tests	Josh Rosen	2014-12-31	1	-0/+2
\| \| \| \|	This should fix a major cause of build breaks when running many parallel tests.
*	[SPARK-4298][Core] - The spark-submit cannot read Main-Class from Manifest.	Brennon York	2014-12-31	1	-8/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resolves a bug where the `Main-Class` from a .jar file wasn't being read in properly. This was caused by the fact that the `primaryResource` object was a URI and needed to be normalized through a call to `.getPath` before it could be passed into the `JarFile` object. Author: Brennon York <brennon.york@capitalone.com> Closes #3561 from brennonyork/SPARK-4298 and squashes the following commits: 5e0fce1 [Brennon York] Use string interpolation for error messages, moved comment line from original code to above its necessary code segment 14daa20 [Brennon York] pushed mainClass assignment into match statement, removed spurious spaces, removed { } from case statements, removed return values c6dad68 [Brennon York] Set case statement to support multiple jar URI's and enabled the 'file' URI to load the main-class 8d20936 [Brennon York] updated to reset the error message back to the default a043039 [Brennon York] updated to split the uri and jar vals 8da7cbf [Brennon York] fixes SPARK-4298 (cherry picked from commit 8e14c5eb551ab06c94859c7f6d8c6b62b4d00d59) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-1010] Clean up uses of System.setProperty in unit tests	Josh Rosen	2014-12-31	16	-212/+178
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures. This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself). For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure. See the block comment at the top of the ResetSystemProperties class for more details. Author: Josh Rosen <joshrosen@databricks.com> Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits: 0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools 3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext 4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties 4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering. 0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite. 7a3d224 [Josh Rosen] Fix trait ordering 3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite 655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite 3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite 8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait. 633a84a [Josh Rosen] Remove use of system properties in FileServerSuite 25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite 1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite 5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite 0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite 51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite 60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite 14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite 628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite 9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite. 4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class. (cherry picked from commit 352ed6bbe3c3b67e52e298e7c535ae414d96beca) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4882] Register PythonBroadcast with Kryo so that PySpark works with ↵	Josh Rosen	2014-12-30	2	-0/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	KryoSerializer This PR fixes an issue where PySpark broadcast variables caused NullPointerExceptions if KryoSerializer was used. The fix is to register PythonBroadcast with Kryo so that it's deserialized with a KryoJavaSerializer. Author: Josh Rosen <joshrosen@databricks.com> Closes #3831 from JoshRosen/SPARK-4882 and squashes the following commits: 0466c7a [Josh Rosen] Register PythonBroadcast with Kryo. d5b409f [Josh Rosen] Enable registrationRequired, which would have caught this bug. 069d8a7 [Josh Rosen] Add failing test for SPARK-4882 (cherry picked from commit efa80a531ecd485f6cf0cdc24ffa42ba17eea46d) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4920][UI] add version on master and worker page for standalone mode	Zhang, Liye	2014-12-30	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \|	Author: Zhang, Liye <liye.zhang@intel.com> Closes #3769 from liyezhang556520/spark-4920_WebVersion and squashes the following commits: 3bb7e0d [Zhang, Liye] add version on master and worker page (cherry picked from commit 9077e721cd36adfecd50cbd1fd7735d28e5be8b5) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions	Yash Datta	2014-12-29	1	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	takeOrdered should skip reduce step in case mapped RDDs have no partitions. This prevents the mentioned exception : 4. run query SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 100; Error trace java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.reduce(RDD.scala:863) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136) Author: Yash Datta <Yash.Datta@guavus.com> Closes #3830 from saucam/fix_takeorder and squashes the following commits: 5974d10 [Yash Datta] SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions (cherry picked from commit 9bc0df6804f241aff24520d9c6ec54d9b11f5785) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-4952][Core]Handle ConcurrentModificationExceptions in ↵	GuoQiang Li	2014-12-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	SparkEnv.environmentDetails Author: GuoQiang Li <witgo@qq.com> Closes #3788 from witgo/SPARK-4952 and squashes the following commits: d903529 [GuoQiang Li] Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails (cherry picked from commit 080ceb771a1e6b9f844cfd4f1baa01133c106888) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
*	[SPARK-4606] Send EOF to child JVM when there's no more data to read.	Marcelo Vanzin	2014-12-23	2	-8/+19
\| \| \| \| \| \| \| \| \| \| \|	Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3460 from vanzin/SPARK-4606 and squashes the following commits: 031207d [Marcelo Vanzin] [SPARK-4606] Send EOF to child JVM when there's no more data to read. (cherry picked from commit 7e2deb71c4239564631b19c748e95c3d1aa1c77d) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4730][YARN] Warn against deprecated YARN settings	Andrew Or	2014-12-23	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	See https://issues.apache.org/jira/browse/SPARK-4730. Author: Andrew Or <andrew@databricks.com> Closes #3590 from andrewor14/yarn-settings and squashes the following commits: 36e0753 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-settings dcd1316 [Andrew Or] Warn against deprecated YARN settings (cherry picked from commit 27c5399f4dd542e36ea579956b8cb0613de25c6d) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4834] [standalone] Clean up application files after app finishes.	Marcelo Vanzin	2014-12-23	7	-8/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 7aacb7bfa added support for sharing downloaded files among multiple executors of the same app. That works great in Yarn, since the app's directory is cleaned up after the app is done. But Spark standalone mode didn't do that, so the lock/cache files created by that change were left around and could eventually fill up the disk hosting /tmp. To solve that, create app-specific directories under the local dirs when launching executors. Multiple executors launched by the same Worker will use the same app directories, so they should be able to share the downloaded files. When the application finishes, a new message is sent to all workers telling them the application has finished; once that message has been received, and all executors registered for the application shut down, then those directories will be cleaned up by the Worker. Note: Unit testing this is hard (if even possible), since local-cluster mode doesn't seem to leave the Master/Worker daemons running long enough after `sc.stop()` is called for the clean up protocol to take effect. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3705 from vanzin/SPARK-4834 and squashes the following commits: b430534 [Marcelo Vanzin] Remove seemingly unnecessary synchronization. 50eb4b9 [Marcelo Vanzin] Review feedback. c0e5ea5 [Marcelo Vanzin] [SPARK-4834] [standalone] Clean up application files after app finishes. (cherry picked from commit dd155369a04d7dfbf6a5745cbb243e22218367dc) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4818][Core] Add 'iterator' to reduce memory consumed by join	zsxwing	2014-12-22	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as, ```Scala val iterable = Seq(1, 2, 3).map(v => { println(v) v }) println("Iterable map done") val iterator = Seq(1, 2, 3).iterator.map(v => { println(v) v }) println("Iterator map done") ``` outputed ``` 1 2 3 Iterable map done Iterator map done ``` So we should use 'iterator' to reduce memory consumed by join. Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E Author: zsxwing <zsxwing@gmail.com> Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits: 48ee7b9 [zsxwing] Remove the explicit types 95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join (cherry picked from commit c233ab3d8d75a33495298964fe73dbf7dd8fe305) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4920][UI]:current spark version in UI is not striking.	genmao.ygm	2014-12-22	2	-13/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is not convenient to see the Spark version. We can keep the same style with Spark website. ![spark_version](https://cloud.githubusercontent.com/assets/7402327/5527025/1c8c721c-8a35-11e4-8d6a-2734f3c6bdf8.jpg) Author: genmao.ygm <genmao.ygm@alibaba-inc.com> Closes #3763 from uncleGen/master-clean-141222 and squashes the following commits: 0dcb9a9 [genmao.ygm] [SPARK-4920][UI]:current spark version in UI is not striking. (cherry picked from commit de9d7d2b5b6d80963505571700e83779fd98f850) Signed-off-by: Andrew Or <andrew@databricks.com>
*	[SPARK-2075][Core] backport for branch-1.2	zsxwing	2014-12-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	backport #3740 for branch-1.2 Author: zsxwing <zsxwing@gmail.com> Closes #3758 from zsxwing/SPARK-2075-branch-1.2 and squashes the following commits: b57d440 [zsxwing] SPARK-2075 backport for branch-1.2
*	[SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ ↵	zsxwing	2014-12-21	6	-6/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and Hadoop 2.+ `NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes. I used the following commands to confirm the generated byte codes are some. ``` mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt diff ~/hadoop1.txt ~/hadoop2.txt ``` However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`. To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`. Author: zsxwing <zsxwing@gmail.com> Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits: 39d9df2 [zsxwing] Fix the code style e4ad8b5 [zsxwing] Use null for the implicit Ordering 734bac9 [zsxwing] Explicitly set the implicit parameters ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD (cherry picked from commit 6ee6aa70b7d52408cc66bd1434cbeae3212e3f01) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[Minor] Build Failed: value defaultProperties not found	huangzhaowei	2014-12-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Mvn Build Failed: value defaultProperties not found .Maybe related to this pr: https://github.com/apache/spark/commit/1d648123a77bbcd9b7a34cc0d66c14fa85edfecd andrewor14 can you look at this problem? Author: huangzhaowei <carlmartinmax@gmail.com> Closes #3749 from SaintBacchus/Mvn-Build-Fail and squashes the following commits: 8e2917c [huangzhaowei] Build Failed: value defaultProperties not found (cherry picked from commit a764960b3b6d842eef7fa4777c8fa99d3f60fa1e) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	SPARK-2641: Passing num executors to spark arguments from properties file	Kanwaljit Singh	2014-12-19	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Since we can set spark executor memory and executor cores using property file, we must also be allowed to set the executor instances. Author: Kanwaljit Singh <kanwaljit.singh@guavus.com> Closes #1657 from kjsingh/branch-1.0 and squashes the following commits: d8a5a12 [Kanwaljit Singh] SPARK-2641: Fixing how spark arguments are loaded from properties file for num executors Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
*	[SPARK-4896] don’t redundantly overwrite executor JAR deps	Ryan Williams	2014-12-19	1	-63/+107
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #2848 from ryan-williams/fetch-file and squashes the following commits: c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently 8e39c16 [Ryan Williams] code review feedback 788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps (cherry picked from commit 7981f969762e77f1752ef8f86c546d4fc32a1a4f) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4889] update history server example cmds	Ryan Williams	2014-12-19	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #3736 from ryan-williams/hist and squashes the following commits: 421d8ff [Ryan Williams] add another random typo fix 76d6a4c [Ryan Williams] remove hdfs example a2d0f82 [Ryan Williams] code review feedback 9ca7629 [Ryan Williams] [SPARK-4889] update history server example cmds (cherry picked from commit cdb2c645ab769a8678dd81cff44a809fcfa4420b) Signed-off-by: Andrew Or <andrew@databricks.com>
*	SPARK-3428. TaskMetrics for running tasks is missing GC time metrics	Sandy Ryza	2014-12-18	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \|	Author: Sandy Ryza <sandy@cloudera.com> Closes #3684 from sryza/sandy-spark-3428 and squashes the following commits: cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics (cherry picked from commit 283263ffaa941e7e9ba147cf0ad377d9202d3761) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4754] Refactor SparkContext into ExecutorAllocationClient	Andrew Or	2014-12-18	4	-15/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is such that the `ExecutorAllocationManager` does not take in the `SparkContext` with all of its dependencies as an argument. This prevents future developers of this class to tie down this class further with the `SparkContext`, which has really become quite a monstrous object. cc'ing pwendell who originally suggested this, and JoshRosen who may have thoughts about the trait mix-in style of `SparkContext`. Author: Andrew Or <andrew@databricks.com> Closes #3614 from andrewor14/dynamic-allocation-sc and squashes the following commits: 187070d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc 59baf6c [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc 347a348 [Andrew Or] Refactor SparkContext into ExecutorAllocationClient (cherry picked from commit 9804a759b68f56eceb8a2f4ea90f76a92b5f9f67) Signed-off-by: Andrew Or <andrew@databricks.com> Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala
*	[SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config	Aaron Davidson	2014-12-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is used in NioBlockTransferService here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala#L66 Author: Aaron Davidson <aaron@databricks.com> Closes #3688 from aarondav/SPARK-4837 and squashes the following commits: ebd2007 [Aaron Davidson] [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config (cherry picked from commit 105293a7d06b26e7b179a0447eb802074ee9c218) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	[SPARK-4884]: Improve Partition docs	Madhu Siddalingaiah	2014-12-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rewording was based on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-td9804.html This is the associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4884 Author: Madhu Siddalingaiah <madhu@madhu.com> Closes #3722 from msiddalingaiah/master and squashes the following commits: 79e679f [Madhu Siddalingaiah] [DOC]: improve documentation 51d14b9 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 38faca4 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again) 332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code> cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master' 0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions (cherry picked from commit d5a596d4188bfa85ff49ee85039f54255c19a4de) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
*	SPARK-785 [CORE] ClosureCleaner not invoked on most PairRDDFunctions	Sean Owen	2014-12-17	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	This looked like perhaps a simple and important one. `combineByKey` looks like it should clean its arguments' closures, and that in turn covers apparently all remaining functions in `PairRDDFunctions` which delegate to it. Author: Sean Owen <sowen@cloudera.com> Closes #3690 from srowen/SPARK-785 and squashes the following commits: 8df68fe [Sean Owen] Clean context of most remaining functions in PairRDDFunctions, which ultimately call combineByKey (cherry picked from commit 2a28bc61009a170af3853c78f7f36970898a6d56) Signed-off-by: Josh Rosen <joshrosen@databricks.com>