diff options
author | Ali Ghodsi <alig@cs.berkeley.edu> | 2013-08-14 19:59:21 -0700 |
---|---|---|
committer | Ali Ghodsi <alig@cs.berkeley.edu> | 2013-08-20 16:13:36 -0700 |
commit | c4d59910b149b8b9bbf729f38e3eef3fb64fc85b (patch) | |
tree | 846a55ca05e5ed172745151ca4a2e91141a40a35 /core | |
parent | 7a2a33e32dede41937570ec77cf1dfad070e963f (diff) | |
download | spark-c4d59910b149b8b9bbf729f38e3eef3fb64fc85b.tar.gz spark-c4d59910b149b8b9bbf729f38e3eef3fb64fc85b.tar.bz2 spark-c4d59910b149b8b9bbf729f38e3eef3fb64fc85b.zip |
added goals inline as comment
Diffstat (limited to 'core')
-rw-r--r-- | core/src/main/scala/spark/rdd/CoalescedRDD.scala | 21 |
1 files changed, 21 insertions, 0 deletions
diff --git a/core/src/main/scala/spark/rdd/CoalescedRDD.scala b/core/src/main/scala/spark/rdd/CoalescedRDD.scala index 61c4d0c004..6af55cd80c 100644 --- a/core/src/main/scala/spark/rdd/CoalescedRDD.scala +++ b/core/src/main/scala/spark/rdd/CoalescedRDD.scala @@ -60,6 +60,27 @@ private[spark] case class CoalescedRDDPartition( * * This transformation is useful when an RDD with many partitions gets filtered into a smaller one, * or to avoid having a large number of small tasks when processing a directory with many files. + * + * If there is no locality information (no preferredLocations) in the parent RDD, then the coalescing + * is very simple: chunk parents that are close in the Array in chunks. + * If there is locality information, it proceeds to pack them with the following three goals in mind: + * + * (1) Balance the groups so they roughly have the same number of parent partitions + * (2) Achieve locality per partition, i.e. there exists one machine which most parent partitions prefer + * (3) Be efficient, i.e. O(n) algorithm for n parent partitions (underlying problem is likely NP-hard) + * (4) Balance preferred machines, i.e. avoid as much as possible picking the same preferred machine + * + * Furthermore, it is assumed that the parent RDD may have many partitions, e.g. 100 000. + * We assume the final number of desired partitions is small, e.g. less than 1000. + * + * The algorithm tries to assign unique preferred machines to each partition. If the number of desired + * partitions is greater than the number of preferred machines (can happen), it needs to start picking + * duplicate preferred machines. This is determined using coupon collector estimation (2n log(n)). + * The load balancing is done using power-of-two randomized bins-balls with one twist: it tries to + * also achieve locality. This is done by allowing a slack (balanceSlack) between two bins. + * If two bins are within the slack in terms of balance, the algorithm will assign partitions + * according to locality. (contact alig for questions) + * */ class CoalescedRDD[T: ClassManifest]( @transient var prev: RDD[T], |