spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-16395][STREAMING] Fail if too many CheckpointWriteHandlers are queued ↵	Sean Owen	2016-07-19	1	-3/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	up in the fixed thread pool ## What changes were proposed in this pull request? Begin failing if checkpoint writes will likely keep up with storage's ability to write them, to fail fast instead of slowly filling memory ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14152 from srowen/SPARK-16395.
*	[SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant ↵	Xin Ren	2016-07-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	definition and inherited from the parent https://issues.apache.org/jira/browse/SPARK-16535 ## What changes were proposed in this pull request? When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot ``` Definition of groupId is redundant, because it's inherited from the parent ``` ![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png) I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok. ``` <groupId>org.apache.spark</groupId> ``` As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1). ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762 ## How was this patch tested? I've tested by re-building the project, and build succeeded. Author: Xin Ren <iamshrek@126.com> Closes #14189 from keypointt/SPARK-16535.
*	[SPARK-16477] Bump master version to 2.1.0-SNAPSHOT	Reynold Xin	2016-07-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14130 from rxin/SPARK-16477.
*	[SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in ↵	Sean Owen	2016-06-24	2	-7/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	favor of commons-lang3 ## What changes were proposed in this pull request? Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException` ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13843 from srowen/SPARK-16129.
*	[SPARK-16120][STREAMING] getCurrentLogFiles in ReceiverSuite WAL generating ↵	Ahmed Mahran	2016-06-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and cleaning case uses external variable instead of the passed parameter ## What changes were proposed in this pull request? In `ReceiverSuite.scala`, in the test case "write ahead log - generating and cleaning", the inner method `getCurrentLogFiles` uses external variable `logDirectory1` instead of the passed parameter `logDirectory`. This PR fixes this by using the passed method argument instead of variable from the outer scope. ## How was this patch tested? The unit test was re-run and the output logs were checked for the correct paths used. tdas Author: Ahmed Mahran <ahmed.mahran@mashin.io> Closes #13825 from ahmed-mahran/b-receiver-suite-wal-gen-cln.
*	[SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API	Sean Owen	2016-06-12	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? - Deprecate old Java accumulator API; should use Scala now - Update Java tests and examples - Don't bother testing old accumulator API in Java 8 (too) - (fix a misspelling too) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13606 from srowen/SPARK-15086.
*	[SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length ↵	wangyang	2016-06-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	== 0 and Seq.length > 0 ## What changes were proposed in this pull request? In scala, immutable.List.length is an expensive operation so we should avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead. ## How was this patch tested? existing tests Author: wangyang <wangyang@haizhi.com> Closes #13601 from yangw1234/isEmpty.
*	[MINOR][X][X] Replace all occurrences of None: Option with Option.empty	Sandeep Singh	2016-06-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Replace all occurrences of `None: Option[X]` with `Option.empty[X]` ## How was this patch tested? Exisiting Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13591 from techaddict/minor-7.
*	[MINOR] Fix Typos 'an -> a'	Zheng RuiFeng	2016-06-06	7	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.
*	[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in ↵	Josh Rosen	2016-06-05	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	PartitionStatistics `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns. This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern. Author: Josh Rosen <joshrosen@databricks.com> Closes #13491 from JoshRosen/foldleft-to-flatmap.
*	[SPARK-10530][CORE] Kill other task attempts when one taskattempt belonging ↵	Devaraj K	2016-05-30	2	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the same task is succeeded in speculation ## What changes were proposed in this pull request? With this patch, TaskSetManager kills other running attempts when any one of the attempt succeeds for the same task. Also killed tasks will not be considered as failed tasks and they get listed separately in the UI and also shows the task state as KILLED instead of FAILED. ## How was this patch tested? core\src\test\scala\org\apache\spark\ui\jobs\JobProgressListenerSuite.scala core\src\test\scala\org\apache\spark\util\JsonProtocolSuite.scala I have verified this patch manually by enabling spark.speculation as true, when any attempt gets succeeded then other running attempts are getting killed for the same task and other pending tasks are getting assigned in those. And also when any attempt gets killed then they are considered as KILLED tasks and not considered as FAILED tasks. Please find the attached screen shots for the reference. ![stage-tasks-table](https://cloud.githubusercontent.com/assets/3174804/14075132/394c6a12-f4f4-11e5-8638-20ff7b8cc9bc.png) ![stages-table](https://cloud.githubusercontent.com/assets/3174804/14075134/3b60f412-f4f4-11e5-9ea6-dd0dcc86eb03.png) Ref : https://github.com/apache/spark/pull/11916 Author: Devaraj K <devaraj@apache.org> Closes #11996 from devaraj-kavali/SPARK-10530.
*	[SPARK-15645][STREAMING] Fix some typos of Streaming module	Xin Ren	2016-05-30	5	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? No code change, just some typo fixing. ## How was this patch tested? Manually run project build with testing, and build is successful. Author: Xin Ren <iamshrek@126.com> Closes #13385 from keypointt/codeWalkThroughStreaming.
*	[MINOR] Fix Typos 'a -> an'	Zheng RuiFeng	2016-05-26	7	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml//scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.
*	[MINOR][MLLIB][STREAMING][SQL] Fix typos	lfzCarlosC	2016-05-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	fixed typos for source code for components [mllib] [streaming] and [SQL] None and obvious. Author: lfzCarlosC <lfz.carlos@gmail.com> Closes #13298 from lfzCarlosC/master.
*	[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into ↵	Sean Owen	2016-05-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-tags ## What changes were proposed in this pull request? (See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.) Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags` ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13074 from srowen/SPARK-15290.
*	[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient	Sean Owen	2016-05-15	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Retry of https://github.com/apache/spark/pull/13049) - update to httpclient 4.5 / httpcore 4.4 - remove some defunct exclusions - manage httpmime version to match - update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used) ## How was this patch tested? Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...` Author: Sean Owen <sowen@cloudera.com> Closes #13117 from srowen/SPARK-12972.2.
*	[SPARK-14897][SQL] upgrade to jetty 9.2.16	bomeng	2016-05-12	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+. `javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version. ## How was this patch tested? Manual test and current test cases should cover it. Author: bomeng <bmeng@us.ibm.com> Closes #12916 from bomeng/SPARK-14897.
*	[SPARK-14976][STREAMING] make StreamingContext.textFileStream support wildcard	mwws	2016-05-11	2	-2/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? make StreamingContext.textFileStream support wildcard like /home/user/*/file ## How was this patch tested? I did manual test and added a new unit test case Author: mwws <wei.mao@intel.com> Author: unknown <maowei@maowei-MOBL.ccr.corp.intel.com> Closes #12752 from mwws/SPARK_FileStream.
*	[MINOR][TEST][STREAMING] make "testDir" able to be claened after test.	mwws	2016-05-09	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	It's a minor bug in test case. `val testDir = null` will keep be `null` as it's immutable, so in finally block, nothing will be cleaned. Another `testDir` variable created in try block is only visible in try block. ## How was this patch tested? Run existing test case and passed. Author: mwws <wei.mao@intel.com> Closes #12999 from mwws/SPARK_MINOR.
*	[SPARK-1239] Improve fetching of map output statuses	Thomas Graves	2016-05-06	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The main issue we are trying to solve is the memory bloat of the Driver when tasks request the map output statuses. This means with a large number of tasks you either need a huge amount of memory on Driver or you have to repartition to smaller number. This makes it really difficult to run over say 50000 tasks. The main issues that cause the memory bloat are: 1) no flow control on sending the map output status responses. We serialize the map status output and then hand off to netty to send. netty is sending asynchronously and it can't send them fast enough to keep up with incoming requests so we end up with lots of copies of the serialized map output statuses sitting there and this causes huge bloat when you have 10's of thousands of tasks and map output status is in the 10's of MB. 2) When initial reduce tasks are started up, they all request the map output statuses from the Driver. These requests are handled by multiple threads in parallel so even though we check to see if we have a cached version, initially when we don't have a cached version yet, many of initial requests can all end up serializing the exact same map output statuses. This patch does a couple of things: - When the map output status size is over a threshold (default 512K) then it uses broadcast to send the map statuses. This means we no longer serialize a large map output status and thus we don't have issues with memory bloat. the messages sizes are now in the 300-400 byte range and the map status output are broadcast. If its under the threadshold it sends it as before, the message contains the DIRECT indicator now. - synchronize the incoming requests to allow one thread to cache the serialized output and broadcast the map output status that can then be used by everyone else. This ensures we don't create multiple broadcast variables when we don't need to. To ensure this happens I added a second thread pool which the Dispatcher hands the requests to so that those threads can block without blocking the main dispatcher threads (which would cause things like heartbeats and such not to come through) Note that some of design and code was contributed by mridulm ## How was this patch tested? Unit tests and a lot of manually testing. Ran with akka and netty rpc. Ran with both dynamic allocation on and off. one of the large jobs I used to test this was a join of 15TB of data. it had 200,000 map tasks, and 20,000 reduce tasks. Executors ranged from 200 to 2000. This job ran successfully with 5GB of memory on the driver with these changes. Without these changes I was using 20GB and only had 500 reduce tasks. The job has 50mb of serialized map output statuses and took roughly the same amount of time for the executors to get the map output statuses as before. Ran a variety of other jobs, from large wordcounts to small ones not using broadcasts. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #12113 from tgravescs/SPARK-1239.
*	[SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements	Jacek Laskowski	2016-05-05	1	-24/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Minor doc and code style fixes ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12928 from jaceklaskowski/SPARK-15152.
*	[SPARK-9819][STREAMING][DOCUMENTATION] Clarify doc for invReduceFunc in ↵	François Garillot	2016-05-03	4	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \|	incremental versions of reduceByWindow - that reduceFunc and invReduceFunc should be associative - that the intermediate result in iterated applications of inverseReduceFunc is its first argument Author: François Garillot <francois@garillot.net> Closes #8103 from huitseeker/issue/invReduceFuncDoc.
*	Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Yin Huai	2016-04-28	1	-1/+1
\| \| \| \| \| \|	spark-mllib-local" This reverts commit dae538a4d7c36191c1feb02ba87ffc624ab960dc.
*	[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Pravin Gadakh	2016-04-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.
*	[SPARK-14930][SPARK-13693] Fix race condition in CheckpointWriter.stop()	Josh Rosen	2016-04-27	1	-12/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CheckpointWriter.stop() is prone to a race condition: if one thread calls `stop()` right as a checkpoint write task begins to execute, that write task may become blocked when trying to access `fs`, the shared Hadoop FileSystem, since both the `fs` getter and `stop` method synchronize on the same lock. Here's a thread-dump excerpt which illustrates the problem: ```java "pool-31-thread-1" #156 prio=5 os_prio=31 tid=0x00007fea02cd2000 nid=0x5c0b waiting for monitor entry [0x000000013bc4c000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.streaming.CheckpointWriter.org$apache$spark$streaming$CheckpointWriter$$fs(Checkpoint.scala:302) - waiting to lock <0x00000007bf53ee78> (a org.apache.spark.streaming.CheckpointWriter) at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:224) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) "pool-1-thread-1-ScalaTest-running-MapWithStateSuite" #11 prio=5 os_prio=31 tid=0x00007fe9ff879800 nid=0x5703 waiting on condition [0x000000012e54c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007bf564568> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465) at org.apache.spark.streaming.CheckpointWriter.stop(Checkpoint.scala:291) - locked <0x00000007bf53ee78> (a org.apache.spark.streaming.CheckpointWriter) at org.apache.spark.streaming.scheduler.JobGenerator.stop(JobGenerator.scala:159) - locked <0x00000007bf53ea90> (a org.apache.spark.streaming.scheduler.JobGenerator) at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:115) - locked <0x00000007bf53d3f0> (a org.apache.spark.streaming.scheduler.JobScheduler) at org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:680) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:679) - locked <0x00000007bf516a70> (a org.apache.spark.streaming.StreamingContext) at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:644) - locked <0x00000007bf516a70> (a org.apache.spark.streaming.StreamingContext) [...] ``` We can fix this problem by having `stop` and `fs` be synchronized on different locks: the synchronization on `stop` only needs to guard against multiple threads calling `stop` at the same time, whereas the synchronization on `fs` is only necessary for cross-thread visibility. There's only ever a single active checkpoint writer thread at a time, so we don't need to guard against concurrent access to `fs`. Thus, `fs` can simply become a `volatile` var, similar to `lastCheckpointTime`. This change should fix [SPARK-13693](https://issues.apache.org/jira/browse/SPARK-13693), a flaky `MapWithStateSuite` test suite which has recently been failing several times per day. It also results in a huge test speedup: prior to this patch, `MapWithStateSuite` took about 80 seconds to run, whereas it now runs in less than 10 seconds. For the `streaming` project's tests as a whole, they now run in ~220 seconds vs. ~354 before. /cc zsxwing and tdas for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #12712 from JoshRosen/fix-checkpoint-writer-race.
*	[MINOR][DOCS] Minor typo fixes	Jacek Laskowski	2016-04-26	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Minor typo fixes (too minor to deserve separate a JIRA) ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12469 from jaceklaskowski/minor-typo-fixes.
*	[SPARK-14868][BUILD] Enable NewLineAtEofChecker in checkstyle and fix ↵	Dongjoon Hyun	2016-04-24	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	lint-java errors ## What changes were proposed in this pull request? Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java code also comply with the rule. This PR aims to enforce the same rule `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java errors since SPARK-14465. The followings are the items. - Adds a new line at the end of the files (19 files) - Fixes 25 lint-java errors (12 RedundantModifier, 6 ArrayTypeStyle, 2 LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder) ## How was this patch tested? After the Jenkins test succeeds, `dev/lint-java` should pass. (Currently, Jenkins dose not run lint-java.) ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12632 from dongjoon-hyun/SPARK-14868.
*	[SPARK-14701][STREAMING] First stop the event loop, then stop the checkpoint ↵	Liwei Lin	2016-04-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	writer in JobGenerator Currently if we call `streamingContext.stop` (e.g. in a `StreamingListener.onBatchCompleted` callback) when a batch is about to complete, a `rejectedException` may get thrown from `checkPointWriter.executor`, since the `eventLoop` will try to process `DoCheckpoint` events even after the `checkPointWriter.executor` was stopped. Please see [SPARK-14701](https://issues.apache.org/jira/browse/SPARK-14701) for details and stack traces. ## What changes were proposed in this pull request? Reversed the stopping order of `event loop` and `checkpoint writer`. ## How was this patch tested? Existing test suits. (no dedicated test suits were added because the change is simple to reason about) Author: Liwei Lin <lwlin7@gmail.com> Closes #12489 from lw-lin/spark-14701.
*	[SPARK-6429] Implement hashCode and equals together	Joan	2016-04-22	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Implement some `hashCode` and `equals` together in order to enable the scalastyle. This is a first batch, I will continue to implement them but I wanted to know your thoughts. Author: Joan <joan@goyeau.com> Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
*	[SPARK-8393][STREAMING] JavaStreamingContext#awaitTermination() throws ↵	Sean Owen	2016-04-21	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	non-declared InterruptedException ## What changes were proposed in this pull request? `JavaStreamingContext.awaitTermination` methods should be declared as `throws[InterruptedException]` so that this exception can be handled in Java code. Note this is not just a doc change, but an API change, since now (in Java) the method has a checked exception to handle. All await-like methods in Java APIs behave this way, so seems worthwhile for 2.0. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #12418 from srowen/SPARK-8393.
*	Revert "[SPARK-14719] WriteAheadLogBasedBlockHandler should ignore ↵	Josh Rosen	2016-04-19	2	-40/+31
\| \| \| \| \| \|	BlockManager put errors" This reverts commit ed2de0299a5a54b566b91ae9f47b6626c484c1d3.
*	[SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture ↵	Josh Rosen	2016-04-19	3	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	full stacktrace When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread. This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`. I tested this manually using https://github.com/JoshRosen/spark/commit/16b31c825197ee31a50214c6ba3c1df08148f403, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR. /cc rxin nongli yhuai anabranch Author: Josh Rosen <joshrosen@databricks.com> Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
*	[SPARK-14719] WriteAheadLogBasedBlockHandler should ignore BlockManager put ↵	Josh Rosen	2016-04-18	2	-31/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	errors WriteAheadLogBasedBlockHandler will currently throw exceptions if its BlockManager `put()` calls fail, even though those calls are only performed as a performance optimization. Instead, it should log and ignore exceptions during that `put()`. This is a longstanding issue that was masked by an incorrect test case. I think that we haven't noticed this in production because 1. most people probably use a `MEMORY_AND_DISK` storage level, and 2. typically, individual blocks may be small enough relative to the total storage memory such that they're able to evict blocks from previous batches, so `put()` failures here may be rare in practice. This patch fixes the faulty test and fixes the bug. /cc tdas Author: Josh Rosen <joshrosen@databricks.com> Closes #12484 from JoshRosen/received-block-hadndler-fix.
*	[SPARK-14667] Remove HashShuffleManager	Reynold Xin	2016-04-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager. ## How was this patch tested? Removed some tests related to the old manager. Author: Reynold Xin <rxin@databricks.com> Closes #12423 from rxin/SPARK-14667.
*	[MINOR] Remove inappropriate type notation and extra anonymous closure ↵	hyukjinkwon	2016-04-16	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	within functional transformations ## What changes were proposed in this pull request? This PR removes - Inappropriate type notations For example, from ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` to ```scala words.foreachRDD { (rdd, time) => ... ``` - Extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` and corrects some obvious style nits. ## How was this patch tested? This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly. The rules applied were below: - For the first correction, ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]$\s[^,]+s=>\s\{[^\}]+\}\s$</parameter></parameters> </check> ``` ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]\s[\{\|\(]([^\n>,]+=>)?\s\{([^()]\|(?R))\}^[,]</parameter></parameters> </check> ``` - For the second correction ```xml <check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]\s[\{\|\(]\s\([^):]:R))\}^[,]</parameter></parameters> </check> ``` Those rules were not added Author: hyukjinkwon <gurwls223@gmail.com> Closes #12413 from HyukjinKwon/SPARK-style.
*	[SPARK-14630][BUILD][CORE][SQL][STREAMING] Code style: public abstract ↵	Liwei Lin	2016-04-14	5	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	methods should have explicit return types ## What changes were proposed in this pull request? Currently many public abstract methods (in abstract classes as well as traits) don't declare return types explicitly, such as in [o.a.s.streaming.dstream.InputDStream](https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/InputDStream.scala#L110): ```scala def start() // should be: def start(): Unit def stop() // should be: def stop(): Unit ``` These methods exist in core, sql, streaming; this PR fixes them. ## How was this patch tested? N/A ## Which piece of scala style rule led to the changes? the rule was added separately in https://github.com/apache/spark/pull/12396 Author: Liwei Lin <lwlin7@gmail.com> Closes #12389 from lw-lin/public-abstract-methods.
*	[MINOR][SQL] Remove extra anonymous closure within functional transformations	hyukjinkwon	2016-04-14	3	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR removes extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` ## How was this patch tested? Related unit tests and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12382 from HyukjinKwon/minor-extra-closers.
*	[SPARK-14544] [SQL] improve performance of SQL UI tab	Davies Liu	2016-04-12	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR improve the performance of SQL UI by: 1) remove the details column in all executions page (the first page in SQL tab). We can check the details by enter the execution page. 2) break-all is super slow in Chrome recently, so switch to break-word. 3) Using "display: none" to hide a block. 4) using one js closure for for all the executions, not one for each. 5) remove the height limitation of details, don't need to scroll it in the tiny window. ## How was this patch tested? Exists tests. ![ui](https://cloud.githubusercontent.com/assets/40902/14445712/68d7b258-0004-11e6-9b48-5d329b05d165.png) Author: Davies Liu <davies@databricks.com> Closes #12311 from davies/ui_perf.
*	[SPARK-14508][BUILD] Add a new ScalaStyle Rule `OmitBracesInCase`	Dongjoon Hyun	2016-04-12	7	-32/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule. ``` case: Always omit braces in case clauses. ``` This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code. ## How was this patch tested? Pass the Jenkins tests (including Scala style checking) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12280 from dongjoon-hyun/SPARK-14508.
*	[SPARK-14475] Propagate user-defined context from driver to executors	Eric Liang	2016-04-11	4	-8/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This adds a new API call `TaskContext.getLocalProperty` for getting properties set in the driver from executors. These local properties are automatically propagated from the driver to executors. For streaming, the context for streaming tasks will be the initial driver context when ssc.start() is called. ## How was this patch tested? Unit tests. cc JoshRosen Author: Eric Liang <ekl@databricks.com> Closes #12248 from ericl/sc-2813.
*	[SPARK-14455][STREAMING] Fix NPE in allocatedExecutors when calling in ↵	jerryshao	2016-04-09	2	-4/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	receiver-less scenario ## What changes were proposed in this pull request? When calling `ReceiverTracker#allocatedExecutors` in receiver-less scenario, NPE will be thrown, since this `ReceiverTracker` actually is not started and `endpoint` is not created. This will be happened when playing streaming dynamic allocation with direct Kafka. ## How was this patch tested? Local integrated test is done. Author: jerryshao <sshao@hortonworks.com> Closes #12236 from jerryshao/SPARK-14455.
*	[SPARK-14437][CORE] Use the address that NettyBlockTransferService listens ↵	Shixiong Zhu	2016-04-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to create BlockManagerId ## What changes were proposed in this pull request? Here is why SPARK-14437 happens: BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname. To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer. ## How was this patch tested? Manually checked the bound address using local-cluster. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12240 from zsxwing/SPARK-14437.
*	[SPARK-14134][CORE] Change the package name used for shading classes.	Marcelo Vanzin	2016-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current package name uses a dash, which is a little weird but seemed to work. That is, until a new test tried to mock a class that references one of those shaded types, and then things started failing. Most changes are just noise to fix the logging configs. For reference, SPARK-8815 also raised this issue, although at the time it did not cause any issues in Spark, so it was not addressed. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11941 from vanzin/SPARK-14134.
*	[SPARK-12133][STREAMING] Streaming dynamic allocation	Tathagata Das	2016-04-06	5	-3/+665
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Added a new Executor Allocation Manager for the Streaming scheduler for doing Streaming Dynamic Allocation. ## How was this patch tested Unit tests, and cluster tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12154 from tdas/streaming-dynamic-allocation.
*	[SPARK-13211][STREAMING] StreamingContext throws NoSuchElementException when ↵	Sean Owen	2016-04-05	3	-9/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	created from non-existent checkpoint directory ## What changes were proposed in this pull request? Take 2: avoid None.get NoSuchElementException in favor of more descriptive IllegalArgumentException if a non-existent checkpoint dir is used without a SparkContext ## How was this patch tested? Jenkins test plus new test for this particular case Author: Sean Owen <sowen@cloudera.com> Closes #12174 from srowen/SPARK-13211.
*	[SPARK-12425][STREAMING] DStream union optimisation	Guillaume Poulin	2016-04-05	2	-11/+4
\| \| \| \| \| \| \| \| \|	Use PartitionerAwareUnionRDD when possbile for optimizing shuffling and preserving the partitioner. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes #10382 from gpoulin/dstream_union_optimisation.
*	[SPARK-14358] Change SparkListener from a trait to an abstract class	Reynold Xin	2016-04-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Scala traits are difficult to maintain binary compatibility on, and as a result we had to introduce JavaSparkListener. In Spark 2.0 we can change SparkListener from a trait to an abstract class and then remove JavaSparkListener. ## How was this patch tested? Updated related unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12142 from rxin/SPARK-14358.
*	[MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.	Dongjoon Hyun	2016-04-02	4	-20/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
*	[MINOR] Typo fixes	Jacek Laskowski	2016-04-02	9	-36/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. Author: Jacek Laskowski <jacek@japila.pl> Closes #11802 from jaceklaskowski/typo-fixes.
*	[SPARK-12857][STREAMING] Standardize "records" and "events" on "records"	Liwei Lin	2016-04-01	4	-40/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently the Streaming tab in web UI uses records and events interchangeably; this PR tries to standardize them on "records". "records" is chosen over "events" because: - "records" is used extensively throughout the streaming documents, codes, and comments - "events" is used only in Streaming UI related codes and comments ## How was this patch tested? - existing test suites - manually checking on the Streaming UI tab Author: Liwei Lin <lwlin7@gmail.com> Closes #12032 from lw-lin/streaming-events-to-records.