spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix ↵	Dongjoon Hyun	2016-04-24	14	-30/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	lint-java errors ## What changes were proposed in this pull request? Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items. - Adds a new line at the end of the files (19 files) - Fixes 25 lint-java errors (12 RedundantModifier, 6 ArrayTypeStyle, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder) ## How was this patch tested? After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.) ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12632 from dongjoon-hyun/SPARK-14868.
*	[SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala ↵	Sean Owen	2016-04-23	2	-9/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Double values; results in type Object ## What changes were proposed in this pull request? Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values ## How was this patch tested? Existing (updated) Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12637 from srowen/SPARK-14873.
*	[SPARK-14669] [SQL] Fix some SQL metrics in codegen and added more	Davies Liu	2016-04-22	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. Fix the "spill size" of TungstenAggregate and Sort 2. Rename "data size" to "peak memory" to match the actual meaning (also consistent with task metrics) 3. Added "data size" for ShuffleExchange and BroadcastExchange 4. Added some timing for Sort, Aggregate and BroadcastExchange (this requires another patch to work) ## How was this patch tested? Existing tests. ![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png) Author: Davies Liu <davies@databricks.com> Closes #12425 from davies/fix_metrics.
*	[SPARK-10001] Consolidate Signaling and SignalLogger.	Reynold Xin	2016-04-22	3	-74/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is a follow-up to #12557, with the following changes: 1. Fixes some of the style issues. 2. Merges Signaling and SignalLogger into a new class called SignalUtils. It was pretty confusing to have Signaling and Signal in one file, and it was also confusing to have two classes named Signaling and one called the other. 3. Made logging registration idempotent. ## How was this patch tested? N/A. Author: Reynold Xin <rxin@databricks.com> Closes #12605 from rxin/SPARK-10001.
*	[SPARK-6429] Implement hashCode and equals together	Joan	2016-04-22	10	-12/+38
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Implement some `hashCode` and `equals` together in order to enable the scalastyle. This is a first batch, I will continue to implement them but I wanted to know your thoughts. Author: Joan <joan@goyeau.com> Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
*	[SPARK-10001] [CORE] Interrupt tasks in repl with Ctrl+C	Jakob Odersky	2016-04-21	2	-28/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Improve signal handling to allow interrupting running tasks from the REPL (with Ctrl+C). If no tasks are running or Ctrl+C is pressed twice, the signal is forwarded to the default handler resulting in the usual termination of the application. This PR is a rewrite of -- and therefore closes #8216 -- as per piaozhexiu's request ## How was this patch tested? Signal handling is not easily testable therefore no unit tests were added. Nevertheless, the new functionality is implemented in a best-effort approach, soft-failing in case signals aren't available on a specific OS. Author: Jakob Odersky <jakob@odersky.com> Closes #12557 from jodersky/SPARK-10001-sigint.
*	[HOTFIX] Fix Java 7 compilation break	Reynold Xin	2016-04-21	4	-11/+6
\|
*	[SPARK-14724] Use radix sort for shuffles and sort operator when possible	Eric Liang	2016-04-21	16	-109/+812
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Spark currently uses TimSort for all in-memory sorts, including sorts done for shuffle. One low-hanging fruit is to use radix sort when possible (e.g. sorting by integer keys). This PR adds a radix sort implementation to the unsafe sort package and switches shuffles and sorts to use it when possible. The current implementation does not have special support for null values, so we cannot radix-sort `LongType`. I will address this in a follow-up PR. ## How was this patch tested? Unit tests, enabling radix sort on existing tests. Microbenchmark results: ``` Running benchmark: radix sort 25000000 Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 3.13.0-44-generic Intel(R) Core(TM) i7-4600U CPU 2.10GHz radix sort 25000000: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- reference TimSort key prefix array 15546 / 15859 1.6 621.9 1.0X reference Arrays.sort 2416 / 2446 10.3 96.6 6.4X radix sort one byte 133 / 137 188.4 5.3 117.2X radix sort two bytes 255 / 258 98.2 10.2 61.1X radix sort eight bytes 991 / 997 25.2 39.6 15.7X radix sort key prefix array 1540 / 1563 16.2 61.6 10.1X ``` I also ran a mix of the supported TPCDS queries and compared TimSort vs RadixSort metrics. The overall benchmark ran ~10% faster with radix sort on. In the breakdown below, the radix-enabled sort phases averaged about 20x faster than TimSort, however sorting is only a small fraction of the overall runtime. About half of the TPCDS queries were able to take advantage of radix sort. ``` TPCDS on master: 2499s real time, 8185s executor - 1171s in TimSort, avg 267 MB/s (note the /s accounting is weird here since dataSize counts the record sizes too) TPCDS with radix enabled: 2294s real time, 7391s executor - 596s in TimSort, avg 254 MB/s - 26s in radix sort, avg 4.2 GB/s ``` cc davies rxin Author: Eric Liang <ekl@databricks.com> Closes #12490 from ericl/sort-benchmark.
*	[SPARK-14699][CORE] Stop endpoints before closing the connections and don't ↵	Shixiong Zhu	2016-04-21	3	-8/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stop client in Outbox ## What changes were proposed in this pull request? In general, `onDisconnected` is for dealing with unexpected network disconnections. When RpcEnv.shutdown is called, the disconnections are expected so RpcEnv should not fire these events. This PR moves `dispatcher.stop()` above closing the connections so that when stopping RpcEnv, the endpoints won't receive `onDisconnected` events. In addition, Outbox should not close the client since it will be reused by others. This PR fixes it as well. ## How was this patch tested? test("SPARK-14699: RpcEnv.shutdown should not fire onDisconnected events") Author: Shixiong Zhu <shixiong@databricks.com> Closes #12481 from zsxwing/SPARK-14699.
*	[SPARK-4452] [CORE] Shuffle data structures can starve others on the same ↵	Lianhui Wang	2016-04-21	7	-46/+321
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	thread for memory ## What changes were proposed in this pull request? In #9241 It implemented a mechanism to call spill() on those SQL operators that support spilling if there is not enough memory for execution. But ExternalSorter and AppendOnlyMap in Spark core are not worked. So this PR make them benefit from #9241. Now when there is not enough memory for execution, it can get memory by spilling ExternalSorter and AppendOnlyMap in Spark core. ## How was this patch tested? add two unit tests for it. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #10024 from lianhuiwang/SPARK-4452-2.
*	[SPARK-13988][CORE] Make replaying event logs multi threaded in Histo…ry ↵	Parth Brahmbhatt	2016-04-21	1	-43/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	server to ensure a single large log does not block other logs from being rendered. ## What changes were proposed in this pull request? The patch makes event log processing multi threaded. ## How was this patch tested? Existing tests pass, there is no new tests needed to test the functionality as this is a perf improvement. I tested the patch locally by generating one big event log (big1), one small event log(small1) and again a big event log(big2). Without this patch UI does not render any app for almost 30 seconds and then big2 and small1 appears. another 30 second delay and finally big1 also shows up in UI. With this change small1 shows up immediately and big1 and big2 comes up in 30 seconds. Locally it also displays them in the correct order in the UI. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #11800 from Parth-Brahmbhatt/SPARK-13988.
*	[SPARK-14779][CORE] Corrected log message in Worker case KillExecutor	Bryan Cutler	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	In o.a.s.deploy.worker.Worker.scala, when receiving a KillExecutor message from an invalid Master, fixed typo by changing the log message to read "..attemped to kill executor.." Author: Bryan Cutler <cutlerb@gmail.com> Closes #12546 from BryanCutler/worker-killexecutor-log-message.
*	[SPARK-14753][CORE] remove internal flag in Accumulable	Wenchen Fan	2016-04-21	13	-119/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases: 1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered. 2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered. For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last. For 2, we can un-register these accumulators immediately. TODO: remove `internal` flag in `AccumulableInfo` with followup PR ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12525 from cloud-fan/acc.
*	[SPARK-14602][YARN] Use SparkConf to propagate the list of cached files.	Marcelo Vanzin	2016-04-20	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change avoids using the environment to pass this information, since with many jars it's easy to hit limits on certain OSes. Instead, it encodes the information into the Spark configuration propagated to the AM. The first problem that needed to be solved is a chicken & egg issue: the config file is distributed using the cache, and it needs to contain information about the files that are being distributed. To solve that, the code now treats the config archive especially, and uses slightly different code to distribute it, so that only its cache path needs to be saved to the config file. The second problem is that the extra information would show up in the Web UI, which made the environment tab even more noisy than it already is when lots of jars are listed. This is solved by two changes: the list of cached files is now read only once in the AM, and propagated down to the ExecutorRunnable code (which actually sends the list to the NMs when starting containers). The second change is to unset those config entries after the list is read, so that the SparkContext never sees them. Tested with both client and cluster mode by running "run-example SparkPi". This uploads a whole lot of files when run from a build dir (instead of a distribution, where the list is cleaned up), and I verified that the configs do not show up in the UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12487 from vanzin/SPARK-14602.
*	[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState ↵	Andrew Or	2016-04-20	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and Create a SparkSession class ## What changes were proposed in this pull request? This PR has two main changes. 1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext. 2. Create a SparkSession Class, which will later be the entry point of Spark SQL users. ## How was this patch tested? Existing tests This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12522 from yhuai/spark-session.
*	[SPARK-14725][CORE] Remove HttpServer class	jerryshao	2016-04-20	1	-181/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This proposal removes the class `HttpServer`, with the changing of internal file/jar/class transmission to RPC layer, currently there's no code using this `HttpServer`, so here propose to remove it. ## How was this patch tested? Unit test is verified locally. Author: jerryshao <sshao@hortonworks.com> Closes #12526 from jerryshao/SPARK-14725.
*	[SPARK-8171][WEB UI] Javascript based infinite scrolling for the log page	Alex Bozarth	2016-04-20	5	-44/+175
\| \| \| \| \| \| \| \|	Updated the log page by replacing the current pagination with a javascript-based infinite scroll solution Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10910 from ajbozarth/spark8171.
*	[SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of ↵	Liwei Lin	2016-04-20	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	call FileSystem.get(conf) ## What changes were proposed in this pull request? - replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)` ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12450 from lw-lin/fix-fs-get.
*	[SPARK-14679][UI] Fix UI DAG visualization OOM.	Ryan Blue	2016-04-20	2	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The DAG visualization can cause an OOM when generating the DOT file. This happens because clusters are not correctly deduped by a contains check because they use the default equals implementation. This adds a working equals implementation. ## How was this patch tested? This adds a test suite that checks the new equals implementation. Author: Ryan Blue <blue@apache.org> Closes #12437 from rdblue/SPARK-14679-fix-ui-oom.
*	[SPARK-14704][CORE] create accumulators in TaskMetrics	Wenchen Fan	2016-04-19	29	-773/+265
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side. After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12472 from cloud-fan/acc.
*	[SPARK-12224][SPARKR] R support for JDBC source	felixcheung	2016-04-19	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add R API for `read.jdbc`, `write.jdbc`. Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database. Refactored some code into util so they could be tested. Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function. Tested: ``` # with postgresql ../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar # read.jdbc df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345) # partitionColumn and numPartitions test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345) a <- SparkR:::toRDD(df) SparkR:::getNumPartitions(a) [1] 4 SparkR:::collectPartition(a, 2L) # defaultParallelism test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345) SparkR:::getNumPartitions(a) [1] 2 # predicates test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345) count(df) == 1 # write.jdbc, default save mode "error" irisDf <- as.DataFrame(sqlContext, iris) write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") "error, already exists" write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345") ``` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10480 from felixcheung/rreadjdbc.
*	[SPARK-14733] Allow custom timing control in microbenchmarks	Eric Liang	2016-04-19	1	-8/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The current benchmark framework runs a code block for several iterations and reports statistics. However there is no way to exclude per-iteration setup time from the overall results. This PR adds a timer control object passed into the closure that can be used for this purpose. ## How was this patch tested? Existing benchmark code. Also see https://github.com/apache/spark/pull/12490 Author: Eric Liang <ekl@databricks.com> Closes #12502 from ericl/spark-14733.
*	[SPARK-14042][CORE] Add custom coalescer support	Nezih Yigitbasi	2016-04-19	4	-54/+205
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds support for specifying an optional custom coalescer to the `coalesce()` method. Currently I have only added this feature to the `RDD` interface, and once we sort out the details we can proceed with adding this feature to the other APIs (`Dataset` etc.) ## How was this patch tested? Added a unit test for this functionality. /cc rxin (per our discussion on the mailing list) Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #11865 from nezihyigitbasi/custom_coalesce_policy.
*	[SPARK-14656][CORE] Fix Benchmark.getPorcessorName() always return "Unknown ↵	Kazuaki Ishizaki	2016-04-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	processor" on Linux ## What changes were proposed in this pull request? This PR returns correct processor name in ```/proc/cpuinfo``` on Linux from ```Benchmark.getPorcessorName()```. Now, this return ```Unknown processor```. Since ```Utils.executeAndGetOutput(Seq("which", "grep"))``` return ```/bin/grep\n```, it is failed to execute ```/bin/grep\n```. This PR strips ```\n``` at the end of the line of a result of ```Utils.executeAndGetOutput()``` Before applying this PR ```` Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64 Unknown processor back-to-back filter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Dataset 472 / 503 21.2 47.2 1.0X DataFrame 51 / 58 198.0 5.1 9.3X RDD 189 / 211 52.8 18.9 2.5X ```` After applying this PR ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_66-b17 on Linux 2.6.32-504.el6.x86_64 Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz back-to-back filter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Dataset 490 / 502 20.4 49.0 1.0X DataFrame 55 / 61 183.4 5.5 9.0X RDD 210 / 237 47.7 21.0 2.3X ``` ## How was this patch tested? Run Benchmark programs on Linux by hand Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #12411 from kiszk/SPARK-14656.
*	[SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture ↵	Josh Rosen	2016-04-19	19	-94/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	full stacktrace When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread. This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`. I tested this manually using https://github.com/JoshRosen/spark/commit/16b31c825197ee31a50214c6ba3c1df08148f403, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR. /cc rxin nongli yhuai anabranch Author: Josh Rosen <joshrosen@databricks.com> Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
*	[SPARK-13904] Add exit code parameter to exitExecutor()	tedyu	2016-04-19	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds exit code parameter to exitExecutor() so that caller can specify different exit code. ## How was this patch tested? Existing test rxin hbhanawat Author: tedyu <yuzhihong@gmail.com> Closes #12457 from tedyu/master.
*	[SPARK-14667] Remove HashShuffleManager	Reynold Xin	2016-04-18	9	-477/+7
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager. ## How was this patch tested? Removed some tests related to the old manager. Author: Reynold Xin <rxin@databricks.com> Closes #12423 from rxin/SPARK-14667.
*	[SPARK-13227] Risky apply() in OpenHashMap	CodingCat	2016-04-18	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-13227 It might confuse the future developers when they use OpenHashMap.apply() with a numeric value type. null.asInstance[Int], null.asInstance[Long], null.asInstace[Float] and null.asInstance[Double] will return 0/0.0/0L, which might confuse the developer if the value set contains 0/0.0/0L with an existing key The current patch only adds the comments describing the issue, with the respect to apply the minimum changes to the code base The more direct, yet more aggressive, approach is use Option as the return type andrewor14 JoshRosen any thoughts about how to avoid the potential issue? Author: CodingCat <zhunansjtu@gmail.com> Closes #11107 from CodingCat/SPARK-13227.
*	[SPARK-14628][CORE][FOLLLOW-UP] Always tracking read/write metrics	Wenchen Fan	2016-04-18	39	-391/+2576
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR is a follow up for https://github.com/apache/spark/pull/12417, now we always track input/output/shuffle metrics in spark JSON protocol and status API. Most of the line changes are because of re-generating the gold answer for `HistoryServerSuite`, and we add a lot of 0 values for read/write metrics. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12462 from cloud-fan/follow.
*	[SPARK-14713][TESTS] Fix the flaky test NettyBlockTransferServiceSuite	Shixiong Zhu	2016-04-18	1	-8/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When there are multiple tests running, "NettyBlockTransferServiceSuite.can bind to a specific port twice and the second increments" may fail. E.g., assume there are 2 tests running. Here are the execution order to reproduce the test failure. \| Execution Order \| Test 1 \| Test 2 \| \| ------------- \| ------------- \| ------------- \| \| 1 \| service0 binds to 17634 \| \| \| 2 \| \| service0 binds to 17635 (17634 is occupied) \| \| 3 \| service1 binds to 17636 \| \| \| 4 \| pass test \| \| \| 5 \| service0.close (release 17634) \| \| \| 6 \| \| service1 binds to 17634 \| \| 7 \| \| `service1.port should be (service0.port + 1)` fails (17634 != 17635 + 1) \| Here is an example in Jenkins: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/786/testReport/junit/org.apache.spark.network.netty/NettyBlockTransferServiceSuite/can_bind_to_a_specific_port_twice_and_the_second_increments/ This PR makes two changes: - Use a random port between 17634 and 27634 to reduce the possibility of port conflicts. - Make `service1` use `service0.port` to bind to avoid the above race condition. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12477 from zsxwing/SPARK-14713.
*	Mark ExternalClusterManager as private[spark].	Reynold Xin	2016-04-16	1	-4/+1
\|
*	[SPARK-13904][SCHEDULER] Add support for pluggable cluster manager	Hemant Bhanawat	2016-04-16	6	-8/+192
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down. To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface. Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence, 1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend. 2. Added functionality of killing all the running tasks in an executor. ## How was this patch tested? ExternalClusterManagerSuite.scala was added to test this patch. Author: Hemant Bhanawat <hemant@snappydata.io> Closes #11723 from hbhanawat/pluggableScheduler.
*	[MINOR] Remove inappropriate type notation and extra anonymous closure ↵	hyukjinkwon	2016-04-16	7	-24/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	within functional transformations ## What changes were proposed in this pull request? This PR removes - Inappropriate type notations For example, from ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` to ```scala words.foreachRDD { (rdd, time) => ... ``` - Extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` and corrects some obvious style nits. ## How was this patch tested? This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly. The rules applied were below: - For the first correction, ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]$\s[^,]+s=>\s\{[^\}]+\}\s$</parameter></parameters> </check> ``` ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]\s[\{\|\(]([^\n>,]+=>)?\s\{([^()]\|(?R))\}^[,]</parameter></parameters> </check> ``` - For the second correction ```xml <check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]\s[\{\|\(]\s\([^):]:R))\}^[,]</parameter></parameters> </check> ``` Those rules were not added Author: hyukjinkwon <gurwls223@gmail.com> Closes #12413 from HyukjinKwon/SPARK-style.
*	[SPARK-14628][CORE] Simplify task metrics by always tracking read/write metrics	Reynold Xin	2016-04-15	33	-609/+326
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Part of the reason why TaskMetrics and its callers are complicated are due to the optional metrics we collect, including input, output, shuffle read, and shuffle write. I think we can always track them and just assign 0 as the initial values. It is usually very obvious whether a task is supposed to read any data or not. By always tracking them, we can remove a lot of map, foreach, flatMap, getOrElse(0L) calls throughout Spark. This patch also changes a few behaviors. 1. Removed the distinction of data read/write methods (e.g. Hadoop, Memory, Network, etc). 2. Accumulate all data reads and writes, rather than only the first method. (Fixes SPARK-5225) ## How was this patch tested? existing tests. This is bases on https://github.com/apache/spark/pull/12388, with more test fixes. Author: Reynold Xin <rxin@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #12417 from cloud-fan/metrics-refactor.
*	[SPARK-14633] Use more readable format to show memory bytes in Error Message	Peter Ableda	2016-04-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Round memory bytes and convert it to Long to it’s original type. This change fixes the formatting issue in the Exception message. ## How was this patch tested? Manual tests were done in CDH cluster. Author: Peter Ableda <peter.ableda@cloudera.com> Closes #12392 from peterableda/SPARK-14633.
*	[SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly	Mark Grover	2016-04-14	2	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Removing references to assembly jar in documentation. Adding an additional (previously undocumented) usage of spark-submit to run examples. ## How was this patch tested? Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit. Author: Mark Grover <mark@apache.org> Closes #12365 from markgrover/spark-14601.
*	[SPARK-14558][CORE] In ClosureCleaner, clean the outer pointer if it's a ↵	Wenchen Fan	2016-04-14	1	-30/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	REPL line object ## What changes were proposed in this pull request? When we clean a closure, if its outermost parent is not a closure, we won't clone and clean it as cloning user's objects is dangerous. However, if it's a REPL line object, which may carry a lot of unnecessary references(like hadoop conf, spark conf, etc.), we should clean it as it's not a user object. This PR improves the check for user's objects to exclude REPL line object. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12327 from cloud-fan/closure.
*	[SPARK-14617] Remove deprecated APIs in TaskMetrics	Reynold Xin	2016-04-14	10	-96/+36
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes some of the deprecated APIs in TaskMetrics. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? N/A - only removals Author: Reynold Xin <rxin@databricks.com> Closes #12375 from rxin/SPARK-14617.
*	[SPARK-14619] Track internal accumulators (metrics) by stage attempt	Reynold Xin	2016-04-14	9	-37/+25
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When there are multiple attempts for a stage, we currently only reset internal accumulator values if all the tasks are resubmitted. It would make more sense to reset the accumulator values for each stage attempt. This will allow us to eventually get rid of the internal flag in the Accumulator class. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12378 from rxin/SPARK-14619.
*	[SPARK-14612][ML] Consolidate the version of dependencies in mllib and ↵	Sean Owen	2016-04-14	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mllib-local into one place ## What changes were proposed in this pull request? Move json4s, breeze dependency declaration into parent ## How was this patch tested? Should be no functional change, but Jenkins tests will test that. Author: Sean Owen <sowen@cloudera.com> Closes #12390 from srowen/SPARK-14612.
*	[SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract ↵	Liwei Lin	2016-04-14	16	-26/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	methods should have explicit return types ## What changes were proposed in this pull request? Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110): ```scala def start() // should be: def start(): Unit def stop() // should be: def stop(): Unit ``` These methods exist in core, sql, streaming; this PR fixes them. ## How was this patch tested? N/A ## Which piece of scala style rule led to the changes? the rule was added separately in https://github.com/apache/spark/pull/12396 Author: Liwei Lin <lwlin7@gmail.com> Closes #12389 from lw-lin/public-abstract-methods.
*	[SPARK-14625] TaskUIData and ExecutorUIData shouldn't be case classes	Reynold Xin	2016-04-14	6	-57/+58
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I was trying to understand the accumulator and metrics update source code and these two classes don't really need to be case classes. It would also be more consistent with other UI classes if they are not case classes. This is part of my bigger effort to simplify accumulators and task metrics. ## How was this patch tested? This is a straightforward refactoring without behavior change. Author: Reynold Xin <rxin@databricks.com> Closes #12386 from rxin/SPARK-14625.
*	[MINOR][SQL] Remove extra anonymous closure within functional transformations	hyukjinkwon	2016-04-14	9	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR removes extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` ## How was this patch tested? Related unit tests and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12382 from HyukjinKwon/minor-extra-closers.
*	[SPARK-14596][SQL] Remove not used SqlNewHadoopRDD and some more unused imports	hyukjinkwon	2016-04-14	2	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Old `HadoopFsRelation` API includes `buildInternalScan()` which uses `SqlNewHadoopRDD` in `ParquetRelation`. Because now the old API is removed, `SqlNewHadoopRDD` is not used anymore. So, this PR removes `SqlNewHadoopRDD` and several unused imports. This was discussed in https://github.com/apache/spark/pull/12326. ## How was this patch tested? Several related existing unit tests and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12354 from HyukjinKwon/SPARK-14596.
*	[SPARK-14537][CORE] Make TaskSchedulerImpl waiting fail if context is shut down	Charles Allen	2016-04-13	1	-0/+5
\| \| \| \| \| \| \| \|	This patch makes the postStartHook throw an IllegalStateException if the SparkContext is shutdown while it is waiting for the backend to be ready Author: Charles Allen <charles@allen-net.com> Closes #12301 from drcrallen/SPARK-14537.
*	[SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java ↵	Liwei Lin	2016-04-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	api and Python api ## What changes were proposed in this pull request? - updated `OFF_HEAP` semantics for `StorageLevels.java` - updated `OFF_HEAP` semantics for `storagelevel.py` ## How was this patch tested? no need to test Author: Liwei Lin <lwlin7@gmail.com> Closes #12126 from lw-lin/storagelevel.py.
*	[SPARK-14363] Fix executor OOM due to memory leak in the Sorter	Sital Kedia	2016-04-12	4	-4/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix memory leak in the Sorter. When the UnsafeExternalSorter spills the data to disk, it does not free up the underlying pointer array. As a result, we see a lot of executor OOM and also memory under utilization. This is a regression partially introduced in PR https://github.com/apache/spark/pull/9241 ## How was this patch tested? Tested by running a job and observed around 30% speedup after this change. Author: Sital Kedia <skedia@fb.com> Closes #12285 from sitalkedia/executor_oom.
*	[SPARK-14544] [SQL] improve performance of SQL UI tab	Davies Liu	2016-04-12	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR improve the performance of SQL UI by: 1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page. 2) break-all is super slow in Chrome recently, so switch to break-word. 3) Using "display: none" to hide a block. 4) using one js closure for for all the executions, not one for each. 5) remove the height limitation of details, don't need to scroll it in the tiny window. ## How was this patch tested? Exists tests. ![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png) Author: Davies Liu <davies@databricks.com> Closes #12311 from davies/ui_perf.
*	[SPARK-14513][CORE] Fix threads left behind after stopping SparkContext	Terence Yim	2016-04-12	3	-2/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Shutting down `QueuedThreadPool` used by Jetty `Server` to avoid threads leakage after SparkContext is stopped. Note: If this fix is going to apply to the `branch-1.6`, one more patch on the `NettyRpcEnv` class is needed so that the `NettyRpcEnv._fileServer.shutdown` is called in the `NettyRpcEnv.cleanup` method. This is due to the removal of `_fileServer` field in the `NettyRpcEnv` class in the master branch. Please advice if a second PR is necessary for bring this fix back to `branch-1.6` ## How was this patch tested? Ran the ./dev/run-tests locally Author: Terence Yim <terence@cask.co> Closes #12318 from chtyim/fixes/SPARK-14513-thread-leak.
*	[SPARK-14508][BUILD] Add a new ScalaStyle Rule `OmitBracesInCase`	Dongjoon Hyun	2016-04-12	35	-141/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule. ``` case: Always omit braces in case clauses. ``` This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code. ## How was this patch tested? Pass the Jenkins tests (including Scala style checking) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12280 from dongjoon-hyun/SPARK-14508.