| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|\
| |
| |
| |
| | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
|
| |\
| | |
| | |
| | | |
Adjusted visibility of various components and documentation for 0.9.0 release.
|
| | | |
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Automatically unpersisting RDDs that have been cleaned up from DStreams
Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads.
This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default.
Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.
|
| | | | |
|
| |\ \ \
| | |_|/
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Improved logic of finding new files in FileInputDStream
Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it.
The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further.
Also reduced line lengths in DStream to <=100 chars.
|
| | | |
| | | |
| | | |
| | | | |
were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation.
|
|\| | |
| |/ /
|/| | |
|
| |/
| |
| |
| | |
`sbt/sbt doc` used to fail. This fixed it.
|
|/ |
|
|\
| |
| |
| |
| |
| |
| |
| | |
Moved DStream and PairDSream to org.apache.spark.streaming.dstream
Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure.
Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
|
| |\
| | |
| | |
| | |
| | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
|
| | |
| | |
| | |
| | | |
level of input streams.
|
| |\ \ |
|
| | | |
| | | |
| | | |
| | | | |
org.apache.spark.streaming to org.apache.spark.streaming.dstream.
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Remove now un-needed hostPort option
I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).
|
| | | | | |
|
|\ \ \ \ \
| |_|_|_|/
|/| | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Remove simple redundant return statements for Scala methods/functions
Remove simple redundant return statements for Scala methods/functions:
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
-) Add small changes to making var to val if possible and remove () for simple get
This hopefully makes the review simpler =)
Pass compile and tests.
|
| |\| | | |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
|
|\ \ \ \ \
| | |_|_|/
| |/| | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Better error handling in Spark Streaming and more API cleanup
Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug.
This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext.
Also fixes the bug that came up with PRs #393 and #381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in #392.
And changed a lot of protected[streaming] to private[streaming].
|
| | | | | |
|
| | |_|/
| |/| |
| | | |
| | | | |
NetworkInputTracker upon close.
|
| | | | |
|
| | | | |
|
| | |/
| |/|
| | |
| | | |
protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
|
| | | |
|
|/ /
| |
| |
| |
| |
| | |
`foreachRDD` makes it clear that the granularity of this operator is per-RDD.
As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other
DStream operators which get pushed down to individual records within each RDD.
|
|\ \
| | |
| | |
| | |
| | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
|
| |\ \
| | |/
| |/|
| | |
| | |
| | |
| | |
| | | |
Set default logging to WARN for Spark streaming examples.
This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
|
| | |
| | |
| | |
| | |
| | |
| | | |
This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
|
| | | |
|
| | | |
|
|\ \ \ |
|
| | | |
| | | |
| | | |
| | | | |
checkpoint.
|
| |\ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | | |
and easily debuggable.
|
| |/ / /
|/| | | |
|
|\ \ \ \
| |/ / /
|/| / /
| |/ / |
|
| |/ |
|
| | |
|
|\|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala
examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Conflicts:
examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
|
| | | |
|
| | |\
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Improvements to DStream window ops and refactoring of Spark's CheckpointSuite
- Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located.
- Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads.
- Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary.
- Added mapSideCombine option to combineByKeyAndWindow.
|
| | | | |
|
| | | |
| | | |
| | | |
| | | | |
mapSideCombine to combineByKeyAndWindow.
|
| | | |\ |
|
| | | |\ \
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala
|
| | | | | | |
|