spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge pull request #408 from pwendell/external-serializers	Patrick Wendell	2014-01-13	4	-13/+64
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.
\| *	Fix for Kryo Serializer	Patrick Wendell	2014-01-13	1	-2/+13
\| \|
\| *	Changing option wording per discussion with Andrew	Patrick Wendell	2014-01-13	4	-8/+8
\| \|
\| *	Improvements to external sorting	Patrick Wendell	2014-01-13	4	-12/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.
* \|	Merge pull request #413 from rxin/scaladoc	Patrick Wendell	2014-01-13	21	-40/+66
\|\ \ \| \| \| \| \| \| \| \| \|	Adjusted visibility of various components and documentation for 0.9.0 release.
\| * \|	Adjusted visibility of various components.	Reynold Xin	2014-01-13	21	-40/+66
\| \| \|
* \| \|	Merge pull request #401 from andrewor14/master	Patrick Wendell	2014-01-13	11	-47/+151
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	External sorting - Add number of bytes spilled to Web UI Additionally, update test suite for external sorting to induce spilling.
\| * \| \|	Wording changes per Patrick	Andrew Or	2014-01-13	2	-7/+7
\| \| \| \|
\| * \| \|	Report bytes spilled for both memory and disk on Web UI	Andrew Or	2014-01-12	8	-35/+75
\| \| \| \|
\| * \| \|	Enable external sorting by default	Andrew Or	2014-01-12	2	-2/+2
\| \| \| \|
\| * \| \|	Get rid of spill map in SparkEnv	Andrew Or	2014-01-12	5	-15/+7
\| \| \| \|
\| * \| \|	Add number of bytes spilled to Web UI	Andrew Or	2014-01-10	11	-14/+75
\| \| \| \|
\| * \| \|	Induce spilling in ExternalAppendOnlyMapSuite	Andrew Or	2014-01-10	1	-33/+44
\| \| \| \|
* \| \| \|	Merge pull request #409 from tdas/unpersist	Patrick Wendell	2014-01-13	5	-30/+63
\|\ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.
\| * \| \| \|	Added unpersisting and modified testsuite to better test out metadata cleaning.	Tathagata Das	2014-01-13	5	-30/+63
\| \| \| \| \|
* \| \| \| \|	Merge pull request #412 from harveyfeng/master	Patrick Wendell	2014-01-13	1	-1/+1
\|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add default value for HadoopRDD's `cloneRecords` constructor arg Small mend to https://github.com/apache/incubator-spark/pull/359/files#diff-1 for backwards compatibility
\| * \| \| \| \|	Add default value for HadoopRDD's `cloneRecords` constructor arg, to ↵	Harvey	2014-01-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	maintain backwards compatibility.
* \| \| \| \| \|	Merge pull request #411 from tdas/filestream-fix	Patrick Wendell	2014-01-13	2	-62/+69
\|\ \ \ \ \ \ \| \|/ / / / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improved logic of finding new files in FileInputDStream Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it. The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further. Also reduced line lengths in DStream to <=100 chars.
\| * \| \| \| \|	Improved file input stream further.	Tathagata Das	2014-01-13	2	-62/+69
\| \|/ / / /
* \| \| \| \|	Merge pull request #410 from rxin/scaladoc1	Reynold Xin	2014-01-13	1	-1/+1
\|\ \ \ \ \ \| \|/ / / / \|/\| \| / / \| \| \|/ / \| \|/\| \| \| \| \| \|	Updated JavaStreamingContext to make scaladoc compile. `sbt/sbt doc` used to fail. This fixed it.
\| * \| \|	Updated JavaStreamingContext to make scaladoc compile.	Reynold Xin	2014-01-13	1	-1/+1
\|/ / / \| \| \| \| \| \| \| \| \|	`sbt/sbt doc` used to fail. This fixed it.
* \| \|	Merge pull request #400 from tdas/dstream-move	Patrick Wendell	2014-01-13	40	-78/+121
\|\ \ \ \| \|_\|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Moved DStream and PairDSream to org.apache.spark.streaming.dstream Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure. Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
\| * \|	Fixed import formatting.	Tathagata Das	2014-01-12	6	-6/+6
\| \| \|
\| * \|	Merge remote-tracking branch 'apache/master' into dstream-move	Tathagata Das	2014-01-12	12	-24/+61
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
\| * \| \|	Fixed persistence logic of WindowedDStream, and fixed default persistence ↵	Tathagata Das	2014-01-12	9	-10/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	level of input streams.
\| * \| \|	Merge remote-tracking branch 'apache/master' into dstream-move	Tathagata Das	2014-01-12	4	-6/+18
\| \|\ \ \
\| * \ \ \	Merge branch 'error-handling' into dstream-move	Tathagata Das	2014-01-12	11	-70/+110
\| \|\ \ \ \
\| * \| \| \| \|	Moved DStream, DStreamCheckpointData and PairDStream from ↵	Tathagata Das	2014-01-12	37	-70/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	org.apache.spark.streaming to org.apache.spark.streaming.dstream.
* \| \| \| \| \|	Merge pull request #397 from pwendell/host-port	Reynold Xin	2014-01-12	16	-52/+7
\|\ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove now un-needed hostPort option I noticed this was logging some scary error messages in various places. After I looked into it, this is no longer really used. I removed the option and re-wrote the one remaining use case (it was unnecessary there anyways).
\| * \| \| \| \| \|	Removing mentions in tests	Patrick Wendell	2014-01-12	12	-18/+2
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	Remove now un-needed hostPort option	Patrick Wendell	2014-01-12	4	-34/+5
\| \| \| \| \| \| \|
* \| \| \| \| \| \|	Merge pull request #399 from pwendell/consolidate-off	Patrick Wendell	2014-01-12	2	-2/+2
\|\ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Disable shuffle file consolidation by default After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.
\| * \| \| \| \| \| \|	Disable shuffle file consolidation by default	Patrick Wendell	2014-01-12	2	-2/+2
\| \|/ / / / / /
* \| \| \| \| \| \|	Merge pull request #395 from hsaputra/remove_simpleredundantreturn_scala	Patrick Wendell	2014-01-12	63	-194/+194
\|\ \ \ \ \ \ \ \| \|_\|_\|_\|_\|/ / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove simple redundant return statements for Scala methods/functions Remove simple redundant return statements for Scala methods/functions: -) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized -) Add small changes to making var to val if possible and remove () for simple get This hopefully makes the review simpler =) Pass compile and tests.
\| * \| \| \| \| \|	Address code review concerns and comments.	Henry Saputra	2014-01-12	8	-19/+18
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	Fix accidental comment modification.	Henry Saputra	2014-01-12	1	-1/+1
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	Merge branch 'master' into remove_simpleredundantreturn_scala	Henry Saputra	2014-01-12	114	-637/+3809
\| \|\\| \| \| \| \|
\| * \| \| \| \| \|	Remove simple redundant return statement for Scala methods/functions:	Henry Saputra	2014-01-12	63	-186/+187
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	-) Only change simple return statements at the end of method -) Ignore the complex if-else check -) Ignore the ones inside synchronized
* \| \| \| \| \| \|	Merge pull request #394 from tdas/error-handling	Patrick Wendell	2014-01-12	23	-231/+591
\|\ \ \ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Better error handling in Spark Streaming and more API cleanup Earlier errors in jobs generated by Spark Streaming (or in the generation of jobs) could not be caught from the main driver thread (i.e. the thread that called StreamingContext.start()) as it would be thrown in different threads. With this change, after `ssc.start`, one can call `ssc.awaitTermination()` which will be block until the ssc is closed, or there is an exception. This makes it easier to debug. This change also adds ssc.stop(<stop-spark-context>) where you can stop StreamingContext without stopping the SparkContext. Also fixes the bug that came up with PRs #393 and #381. MetadataCleaner default value has been changed from 3500 to -1 for normal SparkContext and 3600 when creating a StreamingContext. Also, updated StreamingListenerBus with changes similar to SparkListenerBus in #392. And changed a lot of protected[streaming] to private[streaming].
\| * \ \ \ \ \ \	Merge remote-tracking branch 'apache/master' into error-handling	Tathagata Das	2014-01-12	4	-6/+18
\| \|\ \ \ \ \ \ \ \| \| \|_\|_\|_\|/ / / \| \|/\| \| \| \| / / \| \| \| \|_\|_\|/ / \| \| \|/\| \| \| \|
\| * \| \| \| \| \|	Changed StreamingContext.stopForWait to awaitTermination.	Tathagata Das	2014-01-12	4	-15/+15
\| \| \| \| \| \| \|
\| * \| \| \| \| \|	Fixed bugs to ensure better cleanup of JobScheduler, JobGenerator and ↵	Tathagata Das	2014-01-12	9	-56/+96
\| \| \|_\|_\|/ / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	NetworkInputTracker upon close.
\| * \| \| \| \|	Fixed bugs.	Tathagata Das	2014-01-12	3	-3/+3
\| \| \| \| \| \|
\| * \| \| \| \|	Merge remote-tracking branch 'apache/master' into error-handling	Tathagata Das	2014-01-11	30	-174/+1347
\| \|\ \ \ \ \ \| \| \| \|_\|/ / \| \| \|/\| \| \|
\| * \| \| \| \|	Added waitForStop and stop to JavaStreamingContext.	Tathagata Das	2014-01-11	2	-3/+23
\| \| \| \| \| \|
\| * \| \| \| \|	Converted JobScheduler to use actors for event handling. Changed ↵	Tathagata Das	2014-01-11	17	-185/+485
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
* \| \| \| \| \|	Merge pull request #398 from pwendell/streaming-api	Patrick Wendell	2014-01-12	10	-28/+65
\|\ \ \ \ \ \ \| \|_\|_\|/ / / \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename DStream.foreach to DStream.foreachRDD `foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.
\| * \| \| \| \|	Adding deprecated versions of old code	Patrick Wendell	2014-01-12	2	-8/+45
\| \| \| \| \| \|
\| * \| \| \| \|	Rename DStream.foreach to DStream.foreachRDD	Patrick Wendell	2014-01-12	10	-22/+22
\| \| \|/ / / \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`foreachRDD` makes it clear that the granularity of this operator is per-RDD. As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other DStream operators which get pushed down to individual records within each RDD.
* \| \| \| \|	Merge pull request #396 from pwendell/executor-env	Patrick Wendell	2014-01-12	1	-1/+1
\|\ \ \ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Setting load defaults to true in executor This preserves the behavior in earlier releases. If properties are set for the executors via `spark-env.sh` on the slaves, then they should take precedence over spark defaults. This is useful for if system administrators are setting properties for a standalone cluster, such as shuffle locations. /cc @andrewor14 who initially reported this issue.