aboutsummaryrefslogtreecommitdiff
path: root/streaming
Commit message (Collapse)AuthorAgeFilesLines
* Merge remote-tracking branch 'apache/master' into error-handlingTathagata Das2014-01-111-10/+19
|\
| * Revert "Fix one unit test that was not setting spark.cleaner.ttl"Patrick Wendell2014-01-111-1/+1
| | | | | | | | This reverts commit 942c80b34c4642de3b0237761bc1aaeb8cbdd546.
| * Merge pull request #381 from mateiz/default-ttlMatei Zaharia2014-01-101-1/+1
| |\ | | | | | | | | | | | | | | | Fix default TTL for metadata cleaner It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
| | * Fix one unit test that was not setting spark.cleaner.ttlMatei Zaharia2014-01-101-1/+1
| | |
| * | Merge pull request #383 from tdas/driver-testPatrick Wendell2014-01-1010-193/+463
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | API for automatic driver recovery for streaming programs and other bug fixes 1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory. Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext See the RecoverableNetworkWordCount below as an example of how to use it. 2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint. 3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery. 4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp). 5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared. This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test. Edit: Java example to be added later, unit test added.
| * \ \ Merge pull request #377 from andrewor14/masterPatrick Wendell2014-01-101-10/+19
| |\ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
| | * | Fix wonky imports from mergeAndrew Or2014-01-091-8/+1
| | | |
| | * | Merge github.com:apache/incubator-sparkAndrew Or2014-01-0917-1189/+135
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkEnv.scala streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
| | * \ \ Merge remote-tracking branch 'spark/master'Andrew Or2014-01-0221-241/+378
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
| | * | | | Address Aaron's commentsAndrew Or2013-12-291-4/+4
| | | | | |
| | * | | | Fix streaming JavaAPISuite againAndrew Or2013-12-261-8/+12
| | | | | |
| | * | | | Fix streaming JavaAPISuite that depended on orderAaron Davidson2013-12-261-11/+16
| | | | | |
* | | | | | Added waitForStop and stop to JavaStreamingContext.Tathagata Das2014-01-112-3/+23
| | | | | |
* | | | | | Converted JobScheduler to use actors for event handling. Changed ↵Tathagata Das2014-01-1116-184/+484
| |_|_|_|/ |/| | | | | | | | | | | | | | protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
* | | | | Merge remote-tracking branch 'apache/master' into driver-testTathagata Das2014-01-102-8/+8
|\| | | | | | | | | | | | | | | | | | | | | | | | Conflicts: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
| * | | | Merge pull request #363 from pwendell/streaming-logsPatrick Wendell2014-01-092-5/+5
| |\ \ \ \ | | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | Set default logging to WARN for Spark streaming examples. This programatically sets the log level to WARN by default for streaming tests. If the user has already specified a log4j.properties file, the user's file will take precedence over this default.
| | * | | Set default logging to WARN for Spark streaming examples.Patrick Wendell2014-01-092-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This programatically sets the log level to WARN by default for streaming tests. If the user has already specified a log4j.properties file, the user's file will take precedence over this default.
* | | | | Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate.Tathagata Das2014-01-101-21/+34
| | | | |
* | | | | Updated docs based on Patrick's comments in PR 383.Tathagata Das2014-01-102-11/+16
| | | | |
* | | | | Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-testTathagata Das2014-01-102-55/+96
|\ \ \ \ \
| * | | | | Removed spark.hostPort and other setting from SparkConf before saving to ↵Tathagata Das2014-01-101-18/+6
| | | | | | | | | | | | | | | | | | | | | | | | checkpoint.
| * | | | | Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-testTathagata Das2014-01-104-5/+5
| |\ \ \ \ \
| * | | | | | Refactored graph checkpoint file reading and writing code to make it cleaner ↵Tathagata Das2014-01-102-49/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | and easily debuggable.
* | | | | | | Fixed conf/slaves and updated docs.Tathagata Das2014-01-102-5/+15
| |/ / / / / |/| | | | |
* | | | | | Merge remote-tracking branch 'apache/master' into driver-testTathagata Das2014-01-094-5/+5
|\ \ \ \ \ \ | |/ / / / / |/| / / / / | |/ / / /
| * / / / Use typed getters for configuration settingsMatei Zaharia2014-01-094-5/+5
| |/ / /
* | | | Fixed bugs in reading of checkpoints.Tathagata Das2014-01-102-17/+20
| | | |
* | | | Merge branch 'standalone-driver' into driver-testTathagata Das2014-01-0929-1273/+270
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
| * | | Merge remote-tracking branch 'apache/master' into project-refactorTathagata Das2014-01-0621-234/+377
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
| | * | | Removing SPARK_EXAMPLES_JAR in the codePatrick Wendell2014-01-052-10/+21
| | | |/ | | |/|
| | * | Merge pull request #297 from tdas/window-improvementPatrick Wendell2014-01-025-15/+45
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improvements to DStream window ops and refactoring of Spark's CheckpointSuite - Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located. - Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads. - Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary. - Added mapSideCombine option to combineByKeyAndWindow.
| | | * | Removed unncessary options from WindowedDStream.Tathagata Das2013-12-261-5/+3
| | | | |
| | | * | Updated groupByKeyAndWindow to be computed incrementally, and added ↵Tathagata Das2013-12-265-12/+34
| | | | | | | | | | | | | | | | | | | | mapSideCombine to combineByKeyAndWindow.
| | | * | Merge branch 'scheduler-update' into window-improvementTathagata Das2013-12-234-5/+32
| | | |\|
| | | * | Merge branch 'scheduler-update' into window-improvementTathagata Das2013-12-1957-632/+1024
| | | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala
| | | * | | Added flag in window operation to use partition awaare union.Tathagata Das2013-11-211-1/+3
| | | | | |
| | | * | | Added partitioner aware union, modified DStream.window.Tathagata Das2013-11-211-39/+2
| | | | | |
| | | * | | Added partition aware union to improve reduceByKeyAndWindowTathagata Das2013-11-201-2/+49
| | | | | |
| | * | | | Miscellaneous fixes from code review.Matei Zaharia2014-01-013-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also replaced SparkConf.getOrElse with just a "get" that takes a default value, and added getInt, getLong, etc to make code that uses this simpler later on.
| | * | | | Merge remote-tracking branch 'apache/master' into conf2Matei Zaharia2014-01-017-14/+0
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
| | | * \ \ \ Merge remote-tracking branch 'apache-github/master' into log4j-fix-2Patrick Wendell2014-01-017-142/+220
| | | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
| | | * | | | | Removing initLogging entirelyPatrick Wendell2013-12-307-13/+0
| | | | |_|_|/ | | | |/| | |
| | * | | | | Fix two compile errors introduced in mergeMatei Zaharia2013-12-312-2/+2
| | | | | | |
| | * | | | | Merge remote-tracking branch 'apache/master' into conf2Matei Zaharia2013-12-317-141/+221
| | |\ \ \ \ \ | | | | |/ / / | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
| | * | | | | Fix a few settings that were being read as system properties after mergeMatei Zaharia2013-12-291-2/+2
| | | | | | |
| | * | | | | Merge remote-tracking branch 'origin/master' into conf2Matei Zaharia2013-12-2923-203/+583
| | |\ \ \ \ \ | | | | |/ / / | | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
| | * | | | | Fix other failing testsMatei Zaharia2013-12-284-18/+22
| | | | | | |
| | * | | | | Add a StreamingContext constructor that takes a conf objectMatei Zaharia2013-12-281-1/+20
| | | | | | |
| | * | | | | Fix CheckpointSuite test failuresMatei Zaharia2013-12-281-4/+3
| | | | | | |
| | * | | | | Fix test failures due to setting / clearing clock type in StreamingMatei Zaharia2013-12-284-10/+14
| | | | | | |