aboutsummaryrefslogtreecommitdiff
path: root/docker
diff options
context:
space:
mode:
authorPatrick Wendell <pwendell@gmail.com>2014-01-02 13:20:54 -0800
committerPatrick Wendell <pwendell@gmail.com>2014-01-02 13:20:54 -0800
commit588a1695f4b0b7763ecfa8ea56e371783810dd68 (patch)
tree86913efaabe18c9ca8b27afa9f35947a392f2468 /docker
parent7bafb68d77fa156c0dd7541aeef14626f867726b (diff)
parent5fde4566ea48e5c6d6c50af032a29eaded2d7c43 (diff)
downloadspark-588a1695f4b0b7763ecfa8ea56e371783810dd68.tar.gz
spark-588a1695f4b0b7763ecfa8ea56e371783810dd68.tar.bz2
spark-588a1695f4b0b7763ecfa8ea56e371783810dd68.zip
Merge pull request #297 from tdas/window-improvement
Improvements to DStream window ops and refactoring of Spark's CheckpointSuite - Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located. - Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads. - Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary. - Added mapSideCombine option to combineByKeyAndWindow.
Diffstat (limited to 'docker')
0 files changed, 0 insertions, 0 deletions