aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSandy Ryza <sandy@cloudera.com>2015-01-28 12:41:23 -0800
committerPatrick Wendell <patrick@databricks.com>2015-01-28 12:41:23 -0800
commit406f6d3070441962222f6a25449ea2c48f52ce88 (patch)
tree13b32a67cdcf1b55423cb1f17ee96ca4a960c7bf
parentc8e934ef3cd06f02f9a2946e96a1a52293c22490 (diff)
downloadspark-406f6d3070441962222f6a25449ea2c48f52ce88.tar.gz
spark-406f6d3070441962222f6a25449ea2c48f52ce88.tar.bz2
spark-406f6d3070441962222f6a25449ea2c48f52ce88.zip
SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
Author: Sandy Ryza <sandy@cloudera.com> Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits: 460827a [Sandy Ryza] Python too d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
-rw-r--r--docs/programming-guide.md2
-rw-r--r--python/pyspark/rdd.py4
2 files changed, 3 insertions, 3 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 2443fc29b4..6486614e71 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -886,7 +886,7 @@ for details.
<td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
<td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable&lt;V&gt;) pairs. <br />
<b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or
- average) over each key, using <code>reduceByKey</code> or <code>combineByKey</code> will yield much better
+ average) over each key, using <code>reduceByKey</code> or <code>aggregateByKey</code> will yield much better
performance.
<br />
<b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD.
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index f4cfe4845d..efd2f35912 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -1634,8 +1634,8 @@ class RDD(object):
Hash-partitions the resulting RDD with into numPartitions partitions.
Note: If you are grouping in order to perform an aggregation (such as a
- sum or average) over each key, using reduceByKey will provide much
- better performance.
+ sum or average) over each key, using reduceByKey or aggregateByKey will
+ provide much better performance.
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> map((lambda (x,y): (x, list(y))), sorted(x.groupByKey().collect()))