Merge pull request #297 from tdas/window-improvement - spark

diff options

author	Patrick Wendell <pwendell@gmail.com>	2014-01-02 13:20:54 -0800
committer	Patrick Wendell <pwendell@gmail.com>	2014-01-02 13:20:54 -0800
commit	588a1695f4b0b7763ecfa8ea56e371783810dd68 (patch)
tree	86913efaabe18c9ca8b27afa9f35947a392f2468 /docker
parent	7bafb68d77fa156c0dd7541aeef14626f867726b (diff)
parent	5fde4566ea48e5c6d6c50af032a29eaded2d7c43 (diff)
download	spark-588a1695f4b0b7763ecfa8ea56e371783810dd68.tar.gz spark-588a1695f4b0b7763ecfa8ea56e371783810dd68.tar.bz2 spark-588a1695f4b0b7763ecfa8ea56e371783810dd68.zip

Merge pull request #297 from tdas/window-improvement

Improvements to DStream window ops and refactoring of Spark's CheckpointSuite - Added a new RDD - PartitionerAwareUnionRDD. Using this RDD, one can take multiple RDDs partitioned by the same partitioner and unify them into a single RDD while preserving the partitioner. So m RDDs with p partitions each will be unified to a single RDD with p partitions and the same partitioner. The preferred location for each partition of the unified RDD will be the most common preferred location of the corresponding partitions of the parent RDDs. For example, location of partition 0 of the unified RDD will be where most of partition 0 of the parent RDDs are located. - Improved the performance of DStream's reduceByKeyAndWindow and groupByKeyAndWindow. Both these operations work by doing per-batch reduceByKey/groupByKey and then using PartitionerAwareUnionRDD to union the RDDs across the window. This eliminates a shuffle related to the window operation, which can reduce batch processing time by 30-40% for simple workloads. - Fixed bugs and simplified Spark's CheckpointSuite. Some of the tests were incorrect and unreliable. Added missing tests for ZippedRDD. I can go into greater detail if necessary. - Added mapSideCombine option to combineByKeyAndWindow.

Diffstat (limited to 'docker')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: