| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| | |/ / /
| |/| | | |
|
| |\ \ \ \
| | |/ / /
| |/| | /
| | | |/
| | |/| |
|
| | | | |
|
| |\ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala
examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
|
| | | | |
| | | | |
| | | | |
| | | | | |
JavaStreamingContext.getOrCreate.
|
| | | | | |
|
| | | | | |
|
| | | | | |
|
| |\ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
|
| | | | | | |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
RecoverableNetworkWordCount example to use it.
|
|\ \ \ \ \ \
| |_|_|_|_|/
|/| | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
External Sorting for Aggregator and CoGroupedRDDs (Revisited)
(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)
The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.
The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.
Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
|
| | | | | | |
|
| |\ \ \ \ \
| | | |_|_|/
| | |/| | |
| | | | | |
| | | | | |
| | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/SparkEnv.scala
streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
|
| | | | | | | |
|
| | | | | | | |
|
| | |_|/ / /
| |/| | | | |
|
|\ \ \ \ \ \
| |_|_|/ / /
|/| | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Set default logging to WARN for Spark streaming examples.
This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
|
| | |_|_|/
| |/| | |
| | | | |
| | | | |
| | | | |
| | | | | |
This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
|
|/ / / / |
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Conflicts:
examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
|
| | |/ /
| |/| | |
|
| |\ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Improvements to DStream window ops and refactoring of Spark's CheckpointSuite
- Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located.
- Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads.
- Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary.
- Added mapSideCombine option to combineByKeyAndWindow.
|
| | | | | |
|
| | | | |
| | | | |
| | | | |
| | | | | |
mapSideCombine to combineByKeyAndWindow.
|
| | |\| | |
|
| | |\ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala
|
| | | | | | |
|
| | | | | | |
|
| | | | | | |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Also replaced SparkConf.getOrElse with just a "get" that takes a default
value, and added getInt, getLong, etc to make code that uses this
simpler later on.
|
| |\ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
|
| | |\ \ \ \ \
| | | | |_|_|/
| | | |/| | |
| | | | | | |
| | | | | | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
|
| | | |_|_|/
| | |/| | | |
|
| | | | | | |
|
| |\ \ \ \ \
| | | |/ / /
| | |/| | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
|
| | | | | | |
|
| | | | | | |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
0 partitions).
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
(FileNotFound exception is still caught and ignored).
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
due to failures due FileNotFound exceptions.
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Reynold's comments on PR 289.
|
| | | | | | |
|
| | | | | | |
|
| | |\| | | |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
correctly.
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded.
|
| | |\ \ \ \ |
|
| | |\ \ \ \ \
| | | | |_|_|/
| | | |/| | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
|