From 21570b463388194877003318317aafd842800cac Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Wed, 14 May 2014 22:24:04 -0700 Subject: Documentation: Encourage use of reduceByKey instead of groupByKey. Author: Patrick Wendell Closes #784 from pwendell/group-by-key and squashes the following commits: 9b4505f [Patrick Wendell] Small fix 6347924 [Patrick Wendell] Documentation: Encourage use of reduceByKey instead of groupByKey. --- docs/scala-programming-guide.md | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'docs/scala-programming-guide.md') diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index 3ed86e460c..edaa7d0639 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -196,6 +196,10 @@ The following tables list the transformations and actions currently supported (s groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs.
+Note: If you are grouping in order to perform an aggregation (such as a sum or + average) over each key, using `reduceByKey` or `combineByKey` will yield much better + performance. +
Note: By default, if the RDD already has a partitioner, the task number is decided by the partition number of the partitioner, or else relies on the value of spark.default.parallelism if the property is set , otherwise depends on the partition number of the RDD. You can pass an optional numTasks argument to set a different number of tasks. -- cgit v1.2.3