diff options
author | Sean Owen <sowen@cloudera.com> | 2015-05-21 19:42:51 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2015-05-21 19:42:51 +0100 |
commit | 6e534026963e567f92743c5721de16325645223e (patch) | |
tree | a52c4c10cb6ca8c0eec0b3b04fe8880fbd8cb494 /python/pyspark | |
parent | 4b7ff3092c53827817079e0810563cbb0b9d0747 (diff) | |
download | spark-6e534026963e567f92743c5721de16325645223e.tar.gz spark-6e534026963e567f92743c5721de16325645223e.tar.bz2 spark-6e534026963e567f92743c5721de16325645223e.zip |
[SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative
Document current limitation of rdd.fold.
This does not resolve SPARK-6416 but just documents the issue.
CC JoshRosen
Author: Sean Owen <sowen@cloudera.com>
Closes #6231 from srowen/SPARK-6416 and squashes the following commits:
9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
da40d84 [Sean Owen] Document current limitation of rdd.fold.
Diffstat (limited to 'python/pyspark')
-rw-r--r-- | python/pyspark/rdd.py | 12 |
1 files changed, 10 insertions, 2 deletions
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index 70db4bbe4c..98a8ff8606 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -813,13 +813,21 @@ class RDD(object): def fold(self, zeroValue, op): """ Aggregate the elements of each partition, and then the results for all - the partitions, using a given associative function and a neutral "zero - value." + the partitions, using a given associative and commutative function and + a neutral "zero value." The function C{op(t1, t2)} is allowed to modify C{t1} and return it as its result value to avoid object allocation; however, it should not modify C{t2}. + This behaves somewhat differently from fold operations implemented + for non-distributed collections in functional languages like Scala. + This fold operation may be applied to partitions individually, and then + fold those results into the final result, rather than apply the fold + to each element sequentially in some defined ordering. For functions + that are not commutative, the result may differ from that of a fold + applied to a non-distributed collection. + >>> from operator import add >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add) 15 |