Documentation: Encourage use of reduceByKey instead of groupByKey.

Author: Patrick Wendell <pwendell@gmail.com> Closes #784 from pwendell/group-by-key and squashes the following commits: 9b4505f [Patrick Wendell] Small fix 6347924 [Patrick Wendell] Documentation: Encourage use of reduceByKey instead of groupByKey.
author: Patrick Wendell <pwendell@gmail.com> 2014-05-14 22:24:04 -0700
committer: Patrick Wendell <pwendell@gmail.com> 2014-05-14 22:24:04 -0700
commit: 21570b463388194877003318317aafd842800cac (patch)
tree: b7645576a7c1317a48ad89968bc563312a1c5804 /python/pyspark
parent: f10de042b8e86adf51b70bae2d8589a5cbf02935 (diff)
download: spark-21570b463388194877003318317aafd842800cac.tar.gz
spark-21570b463388194877003318317aafd842800cac.tar.bz2
spark-21570b463388194877003318317aafd842800cac.zip
1 files changed, 4 insertions, 0 deletions
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 4f74824ba4..07578b8d93 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -1152,6 +1152,10 @@ class RDD(object):
         Group the values for each key in the RDD into a single sequence.
         Hash-partitions the resulting RDD with into numPartitions partitions.
 
+        Note: If you are grouping in order to perform an aggregation (such as a
+        sum or average) over each key, using reduceByKey will provide much better
+        performance.
+
         >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
         >>> map((lambda (x,y): (x, list(y))), sorted(x.groupByKey().collect()))
         [('a', [1, 1]), ('b', [1])]
author	Patrick Wendell <pwendell@gmail.com>	2014-05-14 22:24:04 -0700
committer	Patrick Wendell <pwendell@gmail.com>	2014-05-14 22:24:04 -0700
commit	21570b463388194877003318317aafd842800cac (patch)
tree	b7645576a7c1317a48ad89968bc563312a1c5804 /python/pyspark
parent	f10de042b8e86adf51b70bae2d8589a5cbf02935 (diff)
download	spark-21570b463388194877003318317aafd842800cac.tar.gz spark-21570b463388194877003318317aafd842800cac.tar.bz2 spark-21570b463388194877003318317aafd842800cac.zip