spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[java8API] SPARK-964 Investigate the potential for using JDK 8 lambda ↵	Prashant Sharma	2014-03-03	5	-158/+130
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	expressions for the Java/Scala APIs Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits: 95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch. 85a954e [Prashant Sharma] Nit. import orderings. 673f7ac [Prashant Sharma] Added support for -java-home as well 80a13e8 [Prashant Sharma] Used fake class tag syntax 26eb3f6 [Prashant Sharma] Patrick's comments on PR. 35d8d79 [Prashant Sharma] Specified java 8 building in the docs 31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag. 4ab87d3 [Prashant Sharma] Review feedback on the pr c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.
*	SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.	Reynold Xin	2014-03-03	2	-20/+32
\| \| \| \| \| \| \| \| \| \| \| \|	There was actually a problem with the RateLimitedOutputStream implementation where the first second doesn't write anything because of integer rounding. So RateLimitedOutputStream was overly aggressive in throttling. Author: Reynold Xin <rxin@apache.org> Closes #55 from rxin/ratelimitest and squashes the following commits: 52ce1b7 [Reynold Xin] SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.
*	Ignore RateLimitedOutputStreamSuite for now.	Reynold Xin	2014-03-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	This test has been flaky. We can re-enable it after @tdas has a chance to look at it. Author: Reynold Xin <rxin@apache.org> Closes #54 from rxin/ratelimit and squashes the following commits: 1a12198 [Reynold Xin] Ignore RateLimitedOutputStreamSuite for now.
*	SPARK 1084.1 (resubmitted)	Sean Owen	2014-02-27	4	-55/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(Ported from https://github.com/apache/incubator-spark/pull/637 ) Author: Sean Owen <sowen@cloudera.com> Closes #31 from srowen/SPARK-1084.1 and squashes the following commits: 6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it f35b833 [Sean Owen] Fix two misc javadoc problems 254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit 5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates 007762b [Sean Owen] Remove dead scaladoc links b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>
*	Merge pull request #567 from ScrapCodes/style2.	Prashant Sharma	2014-02-09	18	-65/+108
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2 Continuation of PR #557 With this all scala style errors are fixed across the code base !! The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #567 and squashes the following commits: 3b1ec30 [Prashant Sharma] scala style fixes
*	Merge pull request #557 from ScrapCodes/style. Closes #557.	Patrick Wendell	2014-02-09	1	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Author: Patrick Wendell <pwendell@gmail.com> Author: Prashant Sharma <scrapcodes@gmail.com> == Merge branch commits == commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4 Author: Prashant Sharma <scrapcodes@gmail.com> Date: Sun Feb 9 17:39:07 2014 +0530 scala style fixes commit f91709887a8e0b608c5c2b282db19b8a44d53a43 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Jan 24 11:22:53 2014 -0800 Adding scalastyle snapshot
*	Merge pull request #529 from hsaputra/cleanup_right_arrowop_scala	Henry Saputra	2014-02-02	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency Looks like there are some ⇒ Unicode character (maybe from scalariform) in Scala code. This PR is to change it to => to get some consistency on the Scala code. If we want to use ⇒ as default we could use sbt plugin scalariform to make sure all Scala code has ⇒ instead of => And remove unused imports found in TwitterInputDStream.scala while I was there =) Author: Henry Saputra <hsaputra@apache.org> == Merge branch commits == commit 29c1771d346dff901b0b778f764e6b4409900234 Author: Henry Saputra <hsaputra@apache.org> Date: Sat Feb 1 22:05:16 2014 -0800 Change the ⇒ character (maybe from scalariform) to => in Scala code for style consistency.
*	Fix ClassCastException in JavaPairRDD.collectAsMap() (SPARK-1040)	Josh Rosen	2014-01-25	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes an issue where collectAsMap() could fail when called on a JavaPairRDD that was derived by transforming a non-JavaPairRDD. The root problem was that we were creating the JavaPairRDD's ClassTag by casting a ClassTag[AnyRef] to a ClassTag[Tuple2[K2, V2]]. To fix this, I cast a ClassTag[Tuple2[_, _]] instead, since this actually produces a ClassTag of the appropriate type because ClassTags don't capture type parameters: scala> implicitly[ClassTag[Tuple2[_, _]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res8: Boolean = true scala> implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[Tuple2[Int, Int]]] == implicitly[ClassTag[Tuple2[Int, Int]]] res9: Boolean = false
*	Updated java API docs for streaming, along with very minor changes in the ↵	Tathagata Das	2014-01-16	7	-73/+77
\| \| \| \|	code examples.
*	Made some classes private[stremaing] and deprecated a method in ↵	Tathagata Das	2014-01-15	5	-4/+11
\| \| \| \|	JavaStreamingContext.
*	Merge remote-tracking branch 'apache/master' into filestream-fix	Tathagata Das	2014-01-14	1	-0/+17
\|\
\| *	Add missing header files	Patrick Wendell	2014-01-14	1	-0/+17
\| \|
* \|	Changed SparkConf to not be serializable. And also fixed unit-test log paths ↵	Tathagata Das	2014-01-14	4	-13/+53
\|/ \| \| \|	in log4j.properties of external modules.
*	Fixed loose ends in docs.	Tathagata Das	2014-01-14	1	-2/+0
\|
*	Merge remote-tracking branch 'apache/master' into filestream-fix	Tathagata Das	2014-01-13	7	-33/+186
\|\ \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
\| *	Merge pull request #413 from rxin/scaladoc	Patrick Wendell	2014-01-13	3	-5/+127
\| \|\ \| \| \| \| \| \| \| \| \|	Adjusted visibility of various components and documentation for 0.9.0 release.
\| \| *	Adjusted visibility of various components.	Reynold Xin	2014-01-13	3	-5/+127
\| \| \|
\| * \|	Merge pull request #409 from tdas/unpersist	Patrick Wendell	2014-01-13	4	-27/+61
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.
\| \| * \|	Added unpersisting and modified testsuite to better test out metadata cleaning.	Tathagata Das	2014-01-13	4	-27/+61
\| \| \| \|
\| * \| \|	Merge pull request #411 from tdas/filestream-fix	Patrick Wendell	2014-01-13	2	-62/+69
\| \|\ \ \ \| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improved logic of finding new files in FileInputDStream Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it. The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further. Also reduced line lengths in DStream to <=100 chars.
* \| \| \|	Removed StreamingContext.registerInputStream and registerOutputStream - they ↵	Tathagata Das	2014-01-13	14	-80/+92
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation.
* \| \| \|	Merge remote-tracking branch 'apache/master' into filestream-fix	Tathagata Das	2014-01-13	1	-1/+1
\|\\| \| \| \| \|/ / \|/\| \|
\| * \|	Updated JavaStreamingContext to make scaladoc compile.	Reynold Xin	2014-01-13	1	-1/+1
\| \|/ \| \| \| \| \| \|	`sbt/sbt doc` used to fail. This fixed it.
* /	Improved file input stream further.	Tathagata Das	2014-01-13	2	-62/+69
\|/
*	Merge pull request #400 from tdas/dstream-move	Patrick Wendell	2014-01-13	32	-45/+76
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Moved DStream and PairDSream to org.apache.spark.streaming.dstream Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure. Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
\| *	Merge remote-tracking branch 'apache/master' into dstream-move	Tathagata Das	2014-01-12	5	-13/+50
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
\| * \|	Fixed persistence logic of WindowedDStream, and fixed default persistence ↵	Tathagata Das	2014-01-12	5	-9/+33
\| \| \| \| \| \| \| \| \| \| \| \|	level of input streams.
\| * \|	Merge branch 'error-handling' into dstream-move	Tathagata Das	2014-01-12	11	-70/+110
\| \|\ \
\| * \| \|	Moved DStream, DStreamCheckpointData and PairDStream from ↵	Tathagata Das	2014-01-12	30	-39/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	org.apache.spark.streaming to org.apache.spark.streaming.dstream.
* \| \| \|	Merge pull request #397 from pwendell/host-port	Reynold Xin	2014-01-12	5	-7/+1
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove now un-needed hostPort option I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).
\| * \| \| \|	Removing mentions in tests	Patrick Wendell	2014-01-12	5	-7/+1
\| \| \| \| \|
* \| \| \| \|	Merge pull request #395 from hsaputra/remove_simpleredundantreturn_scala	Patrick Wendell	2014-01-12	6	-10/+10
\|\ \ \ \ \ \| \|_\|_\|_\|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove simple redundant return statements for Scala methods/functions Remove simple redundant return statements for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized -) Add small changes to making var to val if possible and remove () for simple get This hopefully makes the review simpler =) Pass compile and tests.
\| * \| \| \|	Merge branch 'master' into remove_simpleredundantreturn_scala	Henry Saputra	2014-01-12	11	-204/+483
\| \|\\| \| \|
\| * \| \| \|	Remove simple redundant return statement for Scala methods/functions:	Henry Saputra	2014-01-12	6	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	-) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized
* \| \| \| \|	Merge pull request #394 from tdas/error-handling	Patrick Wendell	2014-01-12	20	-228/+588
\|\ \ \ \ \ \| \| \|_\|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Better error handling in Spark Streaming and more API cleanup Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug. This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext. Also fixes the bug that came up with PRs #393 and #381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in #392. And changed a lot of protected[streaming] to private[streaming].
\| * \| \| \|	Changed StreamingContext.stopForWait to awaitTermination.	Tathagata Das	2014-01-12	4	-15/+15
\| \| \| \| \|
\| * \| \| \|	Fixed bugs to ensure better cleanup of JobScheduler, JobGenerator and ↵	Tathagata Das	2014-01-12	9	-56/+96
\| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \|	NetworkInputTracker upon close.
\| * \| \|	Fixed bugs.	Tathagata Das	2014-01-12	1	-1/+1
\| \| \| \|
\| * \| \|	Merge remote-tracking branch 'apache/master' into error-handling	Tathagata Das	2014-01-11	1	-10/+19
\| \|\ \ \ \| \| \| \|/ \| \| \|/\|
\| * \| \|	Added waitForStop and stop to JavaStreamingContext.	Tathagata Das	2014-01-11	2	-3/+23
\| \| \| \|
\| * \| \|	Converted JobScheduler to use actors for event handling. Changed ↵	Tathagata Das	2014-01-11	16	-184/+484
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
* \| \| \|	Adding deprecated versions of old code	Patrick Wendell	2014-01-12	2	-8/+45
\| \| \| \|
* \| \| \|	Rename DStream.foreach to DStream.foreachRDD	Patrick Wendell	2014-01-12	4	-12/+12
\| \|/ / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \|	`foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.
* \| \|	Revert "Fix one unit test that was not setting spark.cleaner.ttl"	Patrick Wendell	2014-01-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 942c80b34c4642de3b0237761bc1aaeb8cbdd546.
* \| \|	Merge pull request #381 from mateiz/default-ttl	Matei Zaharia	2014-01-10	1	-1/+1
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
\| * \| \|	Fix one unit test that was not setting spark.cleaner.ttl	Matei Zaharia	2014-01-10	1	-1/+1
\| \| \| \|
* \| \| \|	Merge pull request #383 from tdas/driver-test	Patrick Wendell	2014-01-10	10	-193/+463
\|\ \ \ \ \| \| \|/ / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.
\| * \| \|	Merge remote-tracking branch 'apache/master' into driver-test	Tathagata Das	2014-01-10	2	-8/+8
\| \|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
\| * \| \|	Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate.	Tathagata Das	2014-01-10	1	-21/+34
\| \| \| \|
\| * \| \|	Updated docs based on Patrick's comments in PR 383.	Tathagata Das	2014-01-10	2	-11/+16
\| \| \| \|