| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Improved logic of finding new files in FileInputDStream
Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it.
The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further.
Also reduced line lengths in DStream to <=100 chars.
|
| | |
|
|\ \
| |/
|/|
| |
| |
| | |
Updated JavaStreamingContext to make scaladoc compile.
`sbt/sbt doc` used to fail. This fixed it.
|
|/
|
|
| |
`sbt/sbt doc` used to fail. This fixed it.
|
|\
| |
| |
| |
| |
| |
| |
| | |
Moved DStream and PairDSream to org.apache.spark.streaming.dstream
Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure.
Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
|
| | |
|
| |\
| | |
| | |
| | |
| | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
|
| | |
| | |
| | |
| | | |
level of input streams.
|
| |\ \ |
|
| |\ \ \ |
|
| | | | |
| | | | |
| | | | |
| | | | | |
org.apache.spark.streaming to org.apache.spark.streaming.dstream.
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Remove now un-needed hostPort option
I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).
|
| | | | | | |
|
| | | | | | |
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Disable shuffle file consolidation by default
After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.
|
| |/ / / / / |
|
|\ \ \ \ \ \
| |_|_|_|_|/
|/| | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Remove simple redundant return statements for Scala methods/functions
Remove simple redundant return statements for Scala methods/functions:
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
-) Add small changes to making var to val if possible and remove () for simple get
This hopefully makes the review simpler =)
Pass compile and tests.
|
| | | | | | |
|
| | | | | | |
|
| |\| | | | |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Better error handling in Spark Streaming and more API cleanup
Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug.
This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext.
Also fixes the bug that came up with PRs #393 and #381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in #392.
And changed a lot of protected[streaming] to private[streaming].
|
| |\ \ \ \ \ \
| | |_|_|_|/ /
| |/| | | | /
| | | |_|_|/
| | |/| | | |
|
| | | | | | |
|
| | |_|_|/
| |/| | |
| | | | |
| | | | | |
NetworkInputTracker upon close.
|
| | | | | |
|
| |\ \ \ \
| | | |_|/
| | |/| | |
|
| | | | | |
|
| | | | |
| | | | |
| | | | |
| | | | | |
protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
|
|\ \ \ \ \
| |_|_|/ /
|/| | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Rename DStream.foreach to DStream.foreachRDD
`foreachRDD` makes it clear that the granularity of this operator is per-RDD.
As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other
DStream operators which get pushed down to individual records within each RDD.
|
| | | | | |
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | | |
`foreachRDD` makes it clear that the granularity of this operator is per-RDD.
As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other
DStream operators which get pushed down to individual records within each RDD.
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Setting load defaults to true in executor
This preserves the behavior in earlier releases. If properties are set for the executors via `spark-env.sh` on the slaves, then they should take precedence over spark defaults. This is useful for if system administrators are setting properties for a standalone cluster, such as shuffle locations.
/cc @andrewor14 who initially reported this issue.
|
| |/ / / |
|
|\ \ \ \
| |/ / /
|/| | |
| | | |
| | | |
| | | | |
Stop SparkListenerBus daemon thread when DAGScheduler is stopped.
Otherwise this leads to hundreds of SparkListenerBus daemon threads in our unit tests (and also problematic if user applications launches multiple SparkContext).
|
| | | | |
|
|\ \ \ \
| | | | |
| | | | |
| | | | | |
Minor update for clone writables and more documentation.
|
| | | | | |
|
| |/ / / |
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Fix UI bug introduced in #244.
The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Revert PR 381
This PR missed a bunch of test cases that require "spark.cleaner.ttl". I think it is what is causing test failures on Jenkins right now (though it's a bit hard to tell because the DNS for cs.berkeley.edu is down).
I'm submitting this to see if it fixes jeknins. I did try just patching various tests but it was taking a really long time because there are a bunch of them, so for now I'm just seeing if a revert works.
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
This reverts commit 669ba4caa95014f4511f842206c3e506f1a41a7a.
|
|/ / / / /
| | | | |
| | | | |
| | | | | |
This reverts commit 942c80b34c4642de3b0237761bc1aaeb8cbdd546.
|
|\ \ \ \ \
| |_|/ / /
|/| | | |
| | | | | |
Fix configure didn't work small problem in ALS
|
| | | | | |
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
We clone hadoop key and values by default and reuse objects if asked to.
We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully.
There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.
|
| | | | | | |
|
| | | | | | |
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | | |
Upgrade Kafka dependecy to 0.8.0 release version
|