spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-14505][CORE] Fix bug : creating two SparkContext objects in the same ↵	Allen	2016-05-01	2	-15/+16
\| \| \| \| \| \| \| \| \| \| \| \| \|	jvm, the first one will can not run any task! After creating two SparkContext objects in the same jvm(the second one can not be created successfully!), use the first one to run job will throw exception like below: ![image](https://cloud.githubusercontent.com/assets/7162889/14402832/0c8da2a6-fe73-11e5-8aba-68ee3ddaf605.png) Author: Allen <yufan_1990@163.com> Closes #12273 from the-sea/context-create-bug.
*	[SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0	Herman van Hovell	2016-04-30	1	-9/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR removes three methods the were deprecated in 1.6.0: - `PortableDataStream.close()` - `LinearRegression.weights` - `LogisticRegression.weights` The rationale for doing this is that the impact is small and that Spark 2.0 is a major release. #### How was this patch tested? Compilation succeded. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12732 from hvanhovell/SPARK-14952.
*	[SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfs	Reynold Xin	2016-04-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12806 from rxin/SPARK-15028.
*	[SPARK-15010][CORE] new accumulator shoule be tolerant of local RPC message ↵	Wenchen Fan	2016-04-29	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	delivery ## What changes were proposed in this pull request? The RPC framework will not serialize and deserialize messages in local mode, we should not call `acc.value` when receive heartbeat message, because the serialization hook of new accumulator may not be triggered and the `atDriverSide` flag may not be set. ## How was this patch tested? tested it locally via spark shell Author: Wenchen Fan <wenchen@databricks.com> Closes #12795 from cloud-fan/bug.
*	[SPARK-15003] Use ConcurrentHashMap in place of HashMap for ↵	tedyu	2016-04-30	2	-13/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	NewAccumulator.originals ## What changes were proposed in this pull request? This PR proposes to use ConcurrentHashMap in place of HashMap for NewAccumulator.originals This should result in better performance. ## How was this patch tested? Existing unit test suite cloud-fan Author: tedyu <yuzhihong@gmail.com> Closes #12776 from tedyu/master.
*	[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.	Sun Rui	2016-04-29	3	-5/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.
*	[HOTFIX][CORE] fix a concurrence issue in NewAccumulator	Wenchen Fan	2016-04-28	3	-6/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized. This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized. ## How was this patch tested? I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone. Author: Wenchen Fan <wenchen@databricks.com> Closes #12773 from cloud-fan/debug.
*	Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Yin Huai	2016-04-28	4	-1/+79
\| \| \| \| \| \|	spark-mllib-local" This reverts commit dae538a4d7c36191c1feb02ba87ffc624ab960dc.
*	[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Pravin Gadakh	2016-04-28	4	-79/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.
*	[SPARK-14935][CORE] DistributedSuite "local-cluster format" shouldn't ↵	Xin Ren	2016-04-28	1	-12/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	actually launch clusters https://issues.apache.org/jira/browse/SPARK-14935 In DistributedSuite, the "local-cluster format" test actually launches a bunch of clusters, but this doesn't seem necessary for what should just be a unit test of a regex. We should clean up the code so that this is testable without actually launching a cluster, which should buy us about 20 seconds per build. Passed unit test on my local machine Author: Xin Ren <iamshrek@126.com> Closes #12744 from keypointt/SPARK-14935.
*	[SPARK-14576][WEB UI] Spark console should display Web UI url	Ergin Seyfe	2016-04-28	2	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is a proposal to print the Spark Driver UI link when spark-shell is launched. ## How was this patch tested? Launched spark-shell in local mode and cluster mode. Spark-shell console output included following line: "Spark context Web UI available at <Spark web url>" Author: Ergin Seyfe <eseyfe@fb.com> Closes #12341 from seyfe/spark_console_display_webui_link.
*	[SPARK-14654][CORE] New accumulator API	Wenchen Fan	2016-04-28	42	-587/+905
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR introduces a new accumulator API which is much simpler than before: 1. the type hierarchy is simplified, now we only have an `Accumulator` class 2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue` 3. there in only one `register` method, the accumulator registration and cleanup registration are combined. 4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration. `SQLMetric` is a good example to show the simplicity of this new API. What we break: 1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue` 2. accumulator can't be serialized before registered. Problems need to be addressed in follow-ups: 1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value. 2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event? 3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #12612 from cloud-fan/acc.
*	[SPARK-10001][CORE] Don't short-circuit actions in signal handlers	Jakob Odersky	2016-04-27	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The current signal handlers have a subtle bug that stops evaluating registered actions as soon as one of them returns true, this is because `forall` is short-circuited. This PR adds a strict mapping stage before evaluating returned result. There are no known occurrences of the bug and this is a preemptive fix. ## How was this patch tested? As with the original introduction of signal handlers, this was tested manually (unit testing with signals is not straightforward). Author: Jakob Odersky <jakob@odersky.com> Closes #12745 from jodersky/SPARK-10001-hotfix.
*	[SPARK-14966] SizeEstimator should ignore classes in the scala.reflect package	Josh Rosen	2016-04-27	1	-0/+3
\| \| \| \| \| \| \| \| \| \|	In local profiling, I noticed SizeEstimator spending tons of time estimating the size of objects which contain TypeTag or ClassTag fields. The problem with these tags is that they reference global Scala reflection objects, which, in turn, reference many singletons, such as TestHive. This throws off the accuracy of the size estimation and wastes tons of time traversing a huge object graph. As a result, I think that SizeEstimator should ignore any classes in the `scala.reflect` package. Author: Josh Rosen <joshrosen@databricks.com> Closes #12741 from JoshRosen/ignore-scala-reflect-in-size-estimator.
*	[SPARK-14729][SCHEDULER] Refactored YARN scheduler creation code to use ↵	Hemant Bhanawat	2016-04-27	2	-67/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	newly added ExternalClusterManager ## What changes were proposed in this pull request? With the addition of ExternalClusterManager(ECM) interface in PR #11723, any cluster manager can now be integrated with Spark. It was suggested in ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function into YarnClusterManager that implements ECM interface. ## How was this patch tested? Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too. Author: Hemant Bhanawat <hemant@snappydata.io> Closes #12641 from hbhanawat/yarnClusterMgr.
*	Unintentional white spaces in kryo classes configuration parameters	Victor Chima	2016-04-27	2	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Pruned off white spaces present in the user provided comma separated list of classes for spark.kryo.classesToRegister and spark.kryo.registrator. ## How was this patch tested? Manual tests Author: Victor Chima <blazy2k9@gmail.com> Closes #12701 from blazy2k9/master.
*	[MINOR][BUILD] Enable RAT checking on `LZ4BlockInputStream.java`.	Dongjoon Hyun	2016-04-27	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since `LZ4BlockInputStream.java` is not licensed to Apache Software Foundation (ASF), the Apache License header of that file is not monitored until now. This PR aims to enable RAT checking on `LZ4BlockInputStream.java` by excluding from `dev/.rat-excludes`. This will prevent accidental removal of Apache License header from that file. ## How was this patch tested? Pass the Jenkins tests (Specifically, RAT check stage). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12677 from dongjoon-hyun/minor_rat_exclusion_file.
*	[SPARK-14911] [CORE] Fix a potential data race in TaskMemoryManager	Liwei Lin	2016-04-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? [[SPARK-13210][SQL] catch OOM when allocate memory and expand array](https://github.com/apache/spark/commit/37bc203c8dd5022cb11d53b697c28a737ee85bcc) introduced an `acquiredButNotUsed` field, but it might not be correctly synchronized: - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271)); - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, taskAttemptId, tungstenMemoryMode)` (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400)) might not be correctly synchronized, and thus might not see `acquiredButNotUsed`'s most recent value. This patch makes `acquiredButNotUsed` volatile to fix this. ## How was this patch tested? This should be covered by existing suits. Author: Liwei Lin <lwlin7@gmail.com> Closes #12681 from lw-lin/fix-acquiredButNotUsed.
*	[SPARK-14756][CORE] Use parseLong instead of valueOf	Azeem Jiva	2016-04-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Use Long.parseLong which returns a primative. Use a series of appends() reduces the creation of an extra StringBuilder type ## How was this patch tested? Unit tests Author: Azeem Jiva <azeemj@gmail.com> Closes #12520 from javawithjiva/minor.
*	[SPARK-14889][SPARK CORE] scala.MatchError: NONE (of class ↵	Subhobrata Dey	2016-04-26	2	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	scala.Enumeration) when spark.scheduler.mode=NONE ## What changes were proposed in this pull request? Handling exception for the below mentioned issue ``` ➜ spark git:(master) ✗ ./bin/spark-shell -c spark.scheduler.mode=NONE 16/04/25 09:15:00 ERROR SparkContext: Error initializing SparkContext. scala.MatchError: NONE (of class scala.Enumeration$Val) at org.apache.spark.scheduler.Pool.<init>(Pool.scala:53) at org.apache.spark.scheduler.TaskSchedulerImpl.initialize(TaskSchedulerImpl.scala:131) at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2352) at org.apache.spark.SparkContext.<init>(SparkContext.scala:492) ``` The exception now looks like ``` java.lang.RuntimeException: The scheduler mode NONE is not supported by Spark. ``` ## How was this patch tested? manual tests Author: Subhobrata Dey <sbcd90@gmail.com> Closes #12666 from sbcd90/schedulerModeIssue.
*	[SPARK-14731][shuffle]Revert SPARK-12130 to make 2.0 shuffle service ↵	Lianhui Wang	2016-04-25	3	-6/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	compatible with 1.x ## What changes were proposed in this pull request? SPARK-12130 make 2.0 shuffle service incompatible with 1.x. So from discussion: [http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html](url) we should maintain compatibility between Spark 1.x and Spark 2.x's shuffle service. I put string comparison into executor's register at first avoid string comparison in getBlockData every time. ## How was this patch tested? N/A Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #12568 from lianhuiwang/SPARK-14731.
*	[SPARK-14636] Add minimum memory checks for drivers and executors	Peter Ableda	2016-04-25	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Implement the same memory size validations for the StaticMemoryManager (Legacy) as the UnifiedMemoryManager has. ## How was this patch tested? Manual tests were done in CDH cluster. Test with small executor memory: ` spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn --executor-memory 15m --conf spark.memory.useLegacyMode=true /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples*.jar 10 ` Exception thrown: ``` ERROR spark.SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Executor memory 15728640 must be at least 471859200. Please increase executor memory using the --executor-memory option or spark.executor.memory in Spark configuration. at org.apache.spark.memory.StaticMemoryManager$.org$apache$spark$memory$StaticMemoryManager$$getMaxExecutionMemory(StaticMemoryManager.scala:127) at org.apache.spark.memory.StaticMemoryManager.<init>(StaticMemoryManager.scala:46) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:352) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:289) at org.apache.spark.SparkContext.<init>(SparkContext.scala:462) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` Author: Peter Ableda <peter.ableda@cloudera.com> Closes #12395 from peterableda/SPARK-14636.
*	[SPARK-14881] [PYTHON] [SPARKR] pyspark and sparkR shell default log level ↵	felixcheung	2016-04-24	2	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	should match spark-shell/Scala ## What changes were proposed in this pull request? Change default logging to WARN for pyspark shell and sparkR shell for a much cleaner environment. ## How was this patch tested? Manually running pyspark and sparkR shell Author: felixcheung <felixcheung_m@hotmail.com> Closes #12648 from felixcheung/pylogging.
*	[SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix ↵	Dongjoon Hyun	2016-04-24	14	-30/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	lint-java errors ## What changes were proposed in this pull request? Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items. - Adds a new line at the end of the files (19 files) - Fixes 25 lint-java errors (12 RedundantModifier, 6 ArrayTypeStyle, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder) ## How was this patch tested? After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.) ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12632 from dongjoon-hyun/SPARK-14868.
*	[SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala ↵	Sean Owen	2016-04-23	2	-9/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Double values; results in type Object ## What changes were proposed in this pull request? Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values ## How was this patch tested? Existing (updated) Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12637 from srowen/SPARK-14873.
*	[SPARK-14669] [SQL] Fix some SQL metrics in codegen and added more	Davies Liu	2016-04-22	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. Fix the "spill size" of TungstenAggregate and Sort 2. Rename "data size" to "peak memory" to match the actual meaning (also consistent with task metrics) 3. Added "data size" for ShuffleExchange and BroadcastExchange 4. Added some timing for Sort, Aggregate and BroadcastExchange (this requires another patch to work) ## How was this patch tested? Existing tests. ![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png) Author: Davies Liu <davies@databricks.com> Closes #12425 from davies/fix_metrics.
*	[SPARK-10001] Consolidate Signaling and SignalLogger.	Reynold Xin	2016-04-22	3	-74/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is a follow-up to #12557, with the following changes: 1. Fixes some of the style issues. 2. Merges Signaling and SignalLogger into a new class called SignalUtils. It was pretty confusing to have Signaling and Signal in one file, and it was also confusing to have two classes named Signaling and one called the other. 3. Made logging registration idempotent. ## How was this patch tested? N/A. Author: Reynold Xin <rxin@databricks.com> Closes #12605 from rxin/SPARK-10001.
*	[SPARK-6429] Implement hashCode and equals together	Joan	2016-04-22	10	-12/+38
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Implement some `hashCode` and `equals` together in order to enable the scalastyle. This is a first batch, I will continue to implement them but I wanted to know your thoughts. Author: Joan <joan@goyeau.com> Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
*	[SPARK-10001] [CORE] Interrupt tasks in repl with Ctrl+C	Jakob Odersky	2016-04-21	2	-28/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Improve signal handling to allow interrupting running tasks from the REPL (with Ctrl+C). If no tasks are running or Ctrl+C is pressed twice, the signal is forwarded to the default handler resulting in the usual termination of the application. This PR is a rewrite of -- and therefore closes #8216 -- as per piaozhexiu's request ## How was this patch tested? Signal handling is not easily testable therefore no unit tests were added. Nevertheless, the new functionality is implemented in a best-effort approach, soft-failing in case signals aren't available on a specific OS. Author: Jakob Odersky <jakob@odersky.com> Closes #12557 from jodersky/SPARK-10001-sigint.
*	[HOTFIX] Fix Java 7 compilation break	Reynold Xin	2016-04-21	4	-11/+6
\|
*	[SPARK-14724] Use radix sort for shuffles and sort operator when possible	Eric Liang	2016-04-21	16	-109/+812
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible. The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR. ## How was this patch tested? Unit tests, enabling radix sort on existing tests. Microbenchmark results: ``` Running benchmark: radix sort 25000000 Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic Intel(R) Core(TM) i7-4600U CPU 2.10GHz radix sort 25000000: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- reference TimSort key prefix array 15546 / 15859 1.6 621.9 1.0X reference Arrays.sort 2416 / 2446 10.3 96.6 6.4X radix sort one byte 133 / 137 188.4 5.3 117.2X radix sort two bytes 255 / 258 98.2 10.2 61.1X radix sort eight bytes 991 / 997 25.2 39.6 15.7X radix sort key prefix array 1540 / 1563 16.2 61.6 10.1X ``` I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort. ``` TPCDS on master: 2499s real time, 8185s executor - 1171s in TimSort, avg 267 MB/s (note the /s accounting is weird here since dataSize counts the record sizes too) TPCDS with radix enabled: 2294s real time, 7391s executor - 596s in TimSort, avg 254 MB/s - 26s in radix sort, avg 4.2 GB/s ``` cc davies rxin Author: Eric Liang <ekl@databricks.com> Closes #12490 from ericl/sort-benchmark.
*	[SPARK-14699][CORE] Stop endpoints before closing the connections and don't ↵	Shixiong Zhu	2016-04-21	3	-8/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stop client in Outbox ## What changes were proposed in this pull request? In general, `onDisconnected` is for dealing with unexpected network disconnections. When RpcEnv.shutdown is called, the disconnections are expected so RpcEnv should not fire these events. This PR moves `dispatcher.stop()` above closing the connections so that when stopping RpcEnv, the endpoints won't receive `onDisconnected` events. In addition, Outbox should not close the client since it will be reused by others. This PR fixes it as well. ## How was this patch tested? test("SPARK-14699: RpcEnv.shutdown should not fire onDisconnected events") Author: Shixiong Zhu <shixiong@databricks.com> Closes #12481 from zsxwing/SPARK-14699.
*	[SPARK-4452] [CORE] Shuffle data structures can starve others on the same ↵	Lianhui Wang	2016-04-21	7	-46/+321
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	thread for memory ## What changes were proposed in this pull request? In #9241 It implemented a mechanism to call spill() on those SQL operators that support spilling if there is not enough memory for execution. But ExternalSorter and AppendOnlyMap in Spark core are not worked. So this PR make them benefit from #9241. Now when there is not enough memory for execution, it can get memory by spilling ExternalSorter and AppendOnlyMap in Spark core. ## How was this patch tested? add two unit tests for it. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #10024 from lianhuiwang/SPARK-4452-2.
*	[SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry ↵	Parth Brahmbhatt	2016-04-21	1	-43/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	server to ensure a single large log does not block other logs from being rendered. ## What changes were proposed in this pull request? The patch makes event log processing multi threaded. ## How was this patch tested? Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #11800 from Parth-Brahmbhatt/SPARK-13988.
*	[SPARK-14779][CORE] Corrected log message in Worker case KillExecutor	Bryan Cutler	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	In o.a.s.deploy.worker.Worker.scala, when receiving a KillExecutor message from an invalid Master, fixed typo by changing the log message to read "..attemped to kill executor.." Author: Bryan Cutler <cutlerb@gmail.com> Closes #12546 from BryanCutler/worker-killexecutor-log-message.
*	[SPARK-14753][CORE] remove internal flag in Accumulable	Wenchen Fan	2016-04-21	13	-119/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases: 1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered. 2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered. For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last. For 2, we can un-register these accumulators immediately. TODO: remove `internal` flag in `AccumulableInfo` with followup PR ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12525 from cloud-fan/acc.
*	[SPARK-14602][YARN] Use SparkConf to propagate the list of cached files.	Marcelo Vanzin	2016-04-20	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change avoids using the environment to pass this information, since with many jars it's easy to hit limits on certain OSes. Instead, it encodes the information into the Spark configuration propagated to the AM. The first problem that needed to be solved is a chicken & egg issue: the config file is distributed using the cache, and it needs to contain information about the files that are being distributed. To solve that, the code now treats the config archive especially, and uses slightly different code to distribute it, so that only its cache path needs to be saved to the config file. The second problem is that the extra information would show up in the Web UI, which made the environment tab even more noisy than it already is when lots of jars are listed. This is solved by two changes: the list of cached files is now read only once in the AM, and propagated down to the ExecutorRunnable code (which actually sends the list to the NMs when starting containers). The second change is to unset those config entries after the list is read, so that the SparkContext never sees them. Tested with both client and cluster mode by running "run-example SparkPi". This uploads a whole lot of files when run from a build dir (instead of a distribution, where the list is cleaned up), and I verified that the configs do not show up in the UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12487 from vanzin/SPARK-14602.
*	[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState ↵	Andrew Or	2016-04-20	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and Create a SparkSession class ## What changes were proposed in this pull request? This PR has two main changes. 1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext. 2. Create a SparkSession Class, which will later be the entry point of Spark SQL users. ## How was this patch tested? Existing tests This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12522 from yhuai/spark-session.
*	[SPARK-14725][CORE] Remove HttpServer class	jerryshao	2016-04-20	1	-181/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This proposal removes the class `HttpServer`, with the changing of internal file/jar/class transmission to RPC layer, currently there's no code using this `HttpServer`, so here propose to remove it. ## How was this patch tested? Unit test is verified locally. Author: jerryshao <sshao@hortonworks.com> Closes #12526 from jerryshao/SPARK-14725.
*	[SPARK-8171][WEB UI] Javascript based infinite scrolling for the log page	Alex Bozarth	2016-04-20	5	-44/+175
\| \| \| \| \| \| \| \|	Updated the log page by replacing the current pagination with a javascript-based infinite scroll solution Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10910 from ajbozarth/spark8171.
*	[SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of ↵	Liwei Lin	2016-04-20	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	call FileSystem.get(conf) ## What changes were proposed in this pull request? - replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)` ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12450 from lw-lin/fix-fs-get.
*	[SPARK-14679][UI] Fix UI DAG visualization OOM.	Ryan Blue	2016-04-20	2	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The DAG visualization can cause an OOM when generating the DOT file. This happens because clusters are not correctly deduped by a contains check because they use the default equals implementation. This adds a working equals implementation. ## How was this patch tested? This adds a test suite that checks the new equals implementation. Author: Ryan Blue <blue@apache.org> Closes #12437 from rdblue/SPARK-14679-fix-ui-oom.
*	[SPARK-14704][CORE] create accumulators in TaskMetrics	Wenchen Fan	2016-04-19	29	-773/+265
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side. After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12472 from cloud-fan/acc.
*	[SPARK-12224][SPARKR] R support for JDBC source	felixcheung	2016-04-19	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add R API for `read.jdbc`, `write.jdbc`. Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database. Refactored some code into util so they could be tested. Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function. Tested: ``` # with postgresql ../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar # read.jdbc df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345) # partitionColumn and numPartitions test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345) a <- SparkR:::toRDD(df) SparkR:::getNumPartitions(a) [1] 4 SparkR:::collectPartition(a, 2L) # defaultParallelism test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345) SparkR:::getNumPartitions(a) [1] 2 # predicates test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345) count(df) == 1 # write.jdbc, default save mode "error" irisDf <- as.DataFrame(sqlContext, iris) write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") "error, already exists" write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345") ``` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10480 from felixcheung/rreadjdbc.
*	[SPARK-14733] Allow custom timing control in microbenchmarks	Eric Liang	2016-04-19	1	-8/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The current benchmark framework runs a code block for several iterations and reports statistics. However there is no way to exclude per-iteration setup time from the overall results. This PR adds a timer control object passed into the closure that can be used for this purpose. ## How was this patch tested? Existing benchmark code. Also see https://github.com/apache/spark/pull/12490 Author: Eric Liang <ekl@databricks.com> Closes #12502 from ericl/spark-14733.
*	[SPARK-14042][CORE] Add custom coalescer support	Nezih Yigitbasi	2016-04-19	4	-54/+205
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds support for specifying an optional custom coalescer to the `coalesce()` method. Currently I have only added this feature to the `RDD` interface, and once we sort out the details we can proceed with adding this feature to the other APIs (`Dataset` etc.) ## How was this patch tested? Added a unit test for this functionality. /cc rxin (per our discussion on the mailing list) Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #11865 from nezihyigitbasi/custom_coalesce_policy.
*	[SPARK-14656][CORE] Fix Benchmark.getPorcessorName() always return "Unknown ↵	Kazuaki Ishizaki	2016-04-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	processor" on Linux ## What changes were proposed in this pull request? This PR returns correct processor name in ```/proc/cpuinfo``` on Linux from ```Benchmark.getPorcessorName()```. Now, this return ```Unknown processor```. Since ```Utils.executeAndGetOutput(Seq("which", "grep"))``` return ```/bin/grep\n```, it is failed to execute ```/bin/grep\n```. This PR strips ```\n``` at the end of the line of a result of ```Utils.executeAndGetOutput()``` Before applying this PR ```` Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64 Unknown processor back-to-back filter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Dataset 472 / 503 21.2 47.2 1.0X DataFrame 51 / 58 198.0 5.1 9.3X RDD 189 / 211 52.8 18.9 2.5X ```` After applying this PR ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64 Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz back-to-back filter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Dataset 490 / 502 20.4 49.0 1.0X DataFrame 55 / 61 183.4 5.5 9.0X RDD 210 / 237 47.7 21.0 2.3X ``` ## How was this patch tested? Run Benchmark programs on Linux by hand Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #12411 from kiszk/SPARK-14656.
*	[SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture ↵	Josh Rosen	2016-04-19	19	-94/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	full stacktrace When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread. This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`. I tested this manually using https://github.com/JoshRosen/spark/commit/16b31c825197ee31a50214c6ba3c1df08148f403, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR. /cc rxin nongli yhuai anabranch Author: Josh Rosen <joshrosen@databricks.com> Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
*	[SPARK-13904] Add exit code parameter to exitExecutor()	tedyu	2016-04-19	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds exit code parameter to exitExecutor() so that caller can specify different exit code. ## How was this patch tested? Existing test rxin hbhanawat Author: tedyu <yuzhihong@gmail.com> Closes #12457 from tedyu/master.
*	[SPARK-14667] Remove HashShuffleManager	Reynold Xin	2016-04-18	9	-477/+7
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager. ## How was this patch tested? Removed some tests related to the old manager. Author: Reynold Xin <rxin@databricks.com> Closes #12423 from rxin/SPARK-14667.