spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Updated JavaStreamingContext to make scaladoc compile.	Reynold Xin	2014-01-13	1	-1/+1
\| \| \| \|	`sbt/sbt doc` used to fail. This fixed it.
*	Merge pull request #400 from tdas/dstream-move	Patrick Wendell	2014-01-13	32	-45/+76
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Moved DStream and PairDSream to org.apache.spark.streaming.dstream Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure. Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
\| *	Merge remote-tracking branch 'apache/master' into dstream-move	Tathagata Das	2014-01-12	5	-13/+50
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
\| * \|	Fixed persistence logic of WindowedDStream, and fixed default persistence ↵	Tathagata Das	2014-01-12	5	-9/+33
\| \| \| \| \| \| \| \| \| \| \| \|	level of input streams.
\| * \|	Merge branch 'error-handling' into dstream-move	Tathagata Das	2014-01-12	11	-70/+110
\| \|\ \
\| * \| \|	Moved DStream, DStreamCheckpointData and PairDStream from ↵	Tathagata Das	2014-01-12	30	-39/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	org.apache.spark.streaming to org.apache.spark.streaming.dstream.
* \| \| \|	Merge pull request #397 from pwendell/host-port	Reynold Xin	2014-01-12	5	-7/+1
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove now un-needed hostPort option I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).
\| * \| \| \|	Removing mentions in tests	Patrick Wendell	2014-01-12	5	-7/+1
\| \| \| \| \|
* \| \| \| \|	Merge pull request #395 from hsaputra/remove_simpleredundantreturn_scala	Patrick Wendell	2014-01-12	6	-10/+10
\|\ \ \ \ \ \| \|_\|_\|_\|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove simple redundant return statements for Scala methods/functions Remove simple redundant return statements for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized -) Add small changes to making var to val if possible and remove () for simple get This hopefully makes the review simpler =) Pass compile and tests.
\| * \| \| \|	Merge branch 'master' into remove_simpleredundantreturn_scala	Henry Saputra	2014-01-12	11	-204/+483
\| \|\\| \| \|
\| * \| \| \|	Remove simple redundant return statement for Scala methods/functions:	Henry Saputra	2014-01-12	6	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	-) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized
* \| \| \| \|	Merge pull request #394 from tdas/error-handling	Patrick Wendell	2014-01-12	20	-228/+588
\|\ \ \ \ \ \| \| \|_\|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Better error handling in Spark Streaming and more API cleanup Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug. This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext. Also fixes the bug that came up with PRs #393 and #381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in #392. And changed a lot of protected[streaming] to private[streaming].
\| * \| \| \|	Changed StreamingContext.stopForWait to awaitTermination.	Tathagata Das	2014-01-12	4	-15/+15
\| \| \| \| \|
\| * \| \| \|	Fixed bugs to ensure better cleanup of JobScheduler, JobGenerator and ↵	Tathagata Das	2014-01-12	9	-56/+96
\| \| \|_\|/ \| \|/\| \| \| \| \| \| \| \| \| \|	NetworkInputTracker upon close.
\| * \| \|	Fixed bugs.	Tathagata Das	2014-01-12	1	-1/+1
\| \| \| \|
\| * \| \|	Merge remote-tracking branch 'apache/master' into error-handling	Tathagata Das	2014-01-11	1	-10/+19
\| \|\ \ \ \| \| \| \|/ \| \| \|/\|
\| * \| \|	Added waitForStop and stop to JavaStreamingContext.	Tathagata Das	2014-01-11	2	-3/+23
\| \| \| \|
\| * \| \|	Converted JobScheduler to use actors for event handling. Changed ↵	Tathagata Das	2014-01-11	16	-184/+484
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
* \| \| \|	Adding deprecated versions of old code	Patrick Wendell	2014-01-12	2	-8/+45
\| \| \| \|
* \| \| \|	Rename DStream.foreach to DStream.foreachRDD	Patrick Wendell	2014-01-12	4	-12/+12
\| \|/ / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \|	`foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.
* \| \|	Revert "Fix one unit test that was not setting spark.cleaner.ttl"	Patrick Wendell	2014-01-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 942c80b34c4642de3b0237761bc1aaeb8cbdd546.
* \| \|	Merge pull request #381 from mateiz/default-ttl	Matei Zaharia	2014-01-10	1	-1/+1
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
\| * \| \|	Fix one unit test that was not setting spark.cleaner.ttl	Matei Zaharia	2014-01-10	1	-1/+1
\| \| \| \|
* \| \| \|	Merge pull request #383 from tdas/driver-test	Patrick Wendell	2014-01-10	10	-193/+463
\|\ \ \ \ \| \| \|/ / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.
\| * \| \|	Merge remote-tracking branch 'apache/master' into driver-test	Tathagata Das	2014-01-10	2	-8/+8
\| \|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
\| * \| \|	Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate.	Tathagata Das	2014-01-10	1	-21/+34
\| \| \| \|
\| * \| \|	Updated docs based on Patrick's comments in PR 383.	Tathagata Das	2014-01-10	2	-11/+16
\| \| \| \|
\| * \| \|	Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test	Tathagata Das	2014-01-10	2	-55/+96
\| \|\ \ \
\| \| * \| \|	Removed spark.hostPort and other setting from SparkConf before saving to ↵	Tathagata Das	2014-01-10	1	-18/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkpoint.
\| \| * \| \|	Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test	Tathagata Das	2014-01-10	4	-5/+5
\| \| \|\ \ \
\| \| * \| \| \|	Refactored graph checkpoint file reading and writing code to make it cleaner ↵	Tathagata Das	2014-01-10	2	-49/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and easily debuggable.
\| * \| \| \| \|	Fixed conf/slaves and updated docs.	Tathagata Das	2014-01-10	2	-5/+15
\| \| \|/ / / \| \|/\| \| \|
\| * \| \| \|	Merge remote-tracking branch 'apache/master' into driver-test	Tathagata Das	2014-01-09	4	-5/+5
\| \|\ \ \ \ \| \| \|/ / / \| \|/\| \| / \| \| \| \|/ \| \| \|/\|
\| * \| \|	Fixed bugs in reading of checkpoints.	Tathagata Das	2014-01-10	2	-17/+20
\| \| \| \|
\| * \| \|	Merge branch 'standalone-driver' into driver-test	Tathagata Das	2014-01-09	29	-1273/+270
\| \|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
\| * \| \| \|	Changed the way StreamingContext finds and reads checkpoint files, and added ↵	Tathagata Das	2014-01-09	5	-109/+210
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JavaStreamingContext.getOrCreate.
\| * \| \| \|	More bug fixes.	Tathagata Das	2014-01-08	1	-19/+26
\| \| \| \| \|
\| * \| \| \|	Modified checkpoing file clearing policy.	Tathagata Das	2014-01-08	7	-52/+104
\| \| \| \| \|
\| * \| \| \|	Added a hashmap to cache file mod times.	Tathagata Das	2014-01-05	1	-6/+24
\| \| \| \| \|
\| * \| \| \|	Merge branch 'filestream-fix' into driver-test	Tathagata Das	2014-01-06	7	-142/+219
\| \|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
\| * \| \| \| \|	Bug fixes to the DriverRunner and minor changes here and there.	Tathagata Das	2014-01-06	2	-5/+5
\| \| \| \| \| \|
\| * \| \| \| \|	Added StreamingContext.getOrCreate to for automatic recovery, and added ↵	Tathagata Das	2014-01-02	2	-4/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	RecoverableNetworkWordCount example to use it.
* \| \| \| \| \|	Merge pull request #377 from andrewor14/master	Patrick Wendell	2014-01-10	1	-10/+19
\|\ \ \ \ \ \ \| \|_\|_\|_\|_\|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
\| * \| \| \| \|	Fix wonky imports from merge	Andrew Or	2014-01-09	1	-8/+1
\| \| \| \| \| \|
\| * \| \| \| \|	Merge github.com:apache/incubator-spark	Andrew Or	2014-01-09	17	-1189/+135
\| \|\ \ \ \ \ \| \| \| \|_\|_\|/ \| \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
\| * \| \| \| \|	Merge remote-tracking branch 'spark/master'	Andrew Or	2014-01-02	21	-241/+378
\| \|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
\| * \| \| \| \| \|	Address Aaron's comments	Andrew Or	2013-12-29	1	-4/+4
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	Fix streaming JavaAPISuite again	Andrew Or	2013-12-26	1	-8/+12
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	Fix streaming JavaAPISuite that depended on order	Aaron Davidson	2013-12-26	1	-11/+16
\| \| \|_\|/ / / \| \|/\| \| \| \|
* \| \| \| \| \|	Merge pull request #363 from pwendell/streaming-logs	Patrick Wendell	2014-01-09	2	-5/+5
\|\ \ \ \ \ \ \| \|_\|_\|/ / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Set default logging to WARN for Spark streaming examples. This programatically sets the log level to WARN by default for streaming tests. If the user has already specified a log4j.properties file, the user's file will take precedence over this default.