| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Conflicts:
examples/src/main/java/org/apache/spark/streaming/examples/JavaFlumeEventCount.java
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
|
| | |
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Improvements to DStream window ops and refactoring of Spark's CheckpointSuite
- Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located.
- Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads.
- Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary.
- Added mapSideCombine option to combineByKeyAndWindow.
|
| | | |
|
| | |
| | |
| | |
| | | |
mapSideCombine to combineByKeyAndWindow.
|
| | |\ |
|
| | |\ \
| | | | |
| | | | |
| | | | |
| | | | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/dstream/WindowedDStream.scala
|
| | | | | |
|
| | | | | |
|
| | | | | |
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Also replaced SparkConf.getOrElse with just a "get" that takes a default
value, and added getInt, getLong, etc to make code that uses this
simpler later on.
|
| |\ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
|
| | |\ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
|
| | | |_|_|/
| | |/| | | |
|
| | | | | | |
|
| |\ \ \ \ \
| | | |/ / /
| | |/| | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
|
| | | | | | |
|
| | | | | | |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
0 partitions).
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
(FileNotFound exception is still caught and ignored).
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
due to failures due FileNotFound exceptions.
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Reynold's comments on PR 289.
|
| | | | | | |
|
| | | | | | |
|
| | |\| | | |
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
correctly.
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
FileSystem objects. Refactored JobGenerator to use actor so that all updating of DStream's metadata is single threaded.
|
| | |\ \ \ \ |
|
| | |\ \ \ \ \
| | | | |_|_|/
| | | |/| | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
- Made file stream more robust to transient failures.
- Changed Spark.setCheckpointDir API to not have the second
'useExisting' parameter. Spark will always create a unique directory
for checkpointing underneath the directory provide to the funtion.
- Fixed bug wrt local relative paths as checkpoint directory.
- Made DStream and RDD checkpointing use
SparkContext.hadoopConfiguration, so that more HDFS compatible
filesystems are supported for checkpointing.
|
| | | | | | | |
|
| |\ \ \ \ \ \
| | | |_|_|/ /
| | |/| | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala
core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala
core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
|
| | | | | | | |
|
| | | | | | | |
|
| | | | | | | |
|
| | | | | | | |
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
- Got rid of global SparkContext.globalConf
- Pass SparkConf to serializers and compression codecs
- Made SparkConf public instead of private[spark]
- Improved API of SparkContext and SparkConf
- Switched executor environment vars to be passed through SparkConf
- Fixed some places that were still using system properties
- Fixed some tests, though others are still failing
This still fails several tests in core, repl and streaming, likely due
to properties not being set or cleared correctly (some of the tests run
fine in isolation).
|
| | | | | | | |
|
| | | | | | | |
|
| | | | | | | |
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
repositoris from other poms and sbt.
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
their own self-contained scala API, java API, scala unit tests and java unit tests. Updated examples to use the external projects.
|
| |/ / / / /
|/| | | | | |
|
| | | | | | |
|
| | | | | | |
|
| |_|_|/ /
|/| | | | |
|
| |_|/ /
|/| | | |
|
| | | | |
|
|\| | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Conflicts:
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
streaming/src/main/scala/org/apache/spark/streaming/dstream/ForEachDStream.scala
|
| | | | |
|