diff options
author | Sandy Ryza <sandy@cloudera.com> | 2015-01-28 12:41:23 -0800 |
---|---|---|
committer | Patrick Wendell <patrick@databricks.com> | 2015-01-28 12:41:23 -0800 |
commit | 406f6d3070441962222f6a25449ea2c48f52ce88 (patch) | |
tree | 13b32a67cdcf1b55423cb1f17ee96ca4a960c7bf /python | |
parent | c8e934ef3cd06f02f9a2946e96a1a52293c22490 (diff) | |
download | spark-406f6d3070441962222f6a25449ea2c48f52ce88.tar.gz spark-406f6d3070441962222f6a25449ea2c48f52ce88.tar.bz2 spark-406f6d3070441962222f6a25449ea2c48f52ce88.zip |
SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
Author: Sandy Ryza <sandy@cloudera.com>
Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits:
460827a [Sandy Ryza] Python too
d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
Diffstat (limited to 'python')
-rw-r--r-- | python/pyspark/rdd.py | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index f4cfe4845d..efd2f35912 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -1634,8 +1634,8 @@ class RDD(object): Hash-partitions the resulting RDD with into numPartitions partitions. Note: If you are grouping in order to perform an aggregation (such as a - sum or average) over each key, using reduceByKey will provide much - better performance. + sum or average) over each key, using reduceByKey or aggregateByKey will + provide much better performance. >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> map((lambda (x,y): (x, list(y))), sorted(x.groupByKey().collect())) |