diff options
author | Patrick Wendell <pwendell@gmail.com> | 2014-05-14 22:24:04 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-05-14 22:24:04 -0700 |
commit | 21570b463388194877003318317aafd842800cac (patch) | |
tree | b7645576a7c1317a48ad89968bc563312a1c5804 /docs | |
parent | f10de042b8e86adf51b70bae2d8589a5cbf02935 (diff) | |
download | spark-21570b463388194877003318317aafd842800cac.tar.gz spark-21570b463388194877003318317aafd842800cac.tar.bz2 spark-21570b463388194877003318317aafd842800cac.zip |
Documentation: Encourage use of reduceByKey instead of groupByKey.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #784 from pwendell/group-by-key and squashes the following commits:
9b4505f [Patrick Wendell] Small fix
6347924 [Patrick Wendell] Documentation: Encourage use of reduceByKey instead of groupByKey.
Diffstat (limited to 'docs')
-rw-r--r-- | docs/scala-programming-guide.md | 4 |
1 files changed, 4 insertions, 0 deletions
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index 3ed86e460c..edaa7d0639 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -196,6 +196,10 @@ The following tables list the transformations and actions currently supported (s <tr> <td> <b>groupByKey</b>([<i>numTasks</i>]) </td> <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs. <br /> +<b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or + average) over each key, using `reduceByKey` or `combineByKey` will yield much better + performance. +<br /> <b>Note:</b> By default, if the RDD already has a partitioner, the task number is decided by the partition number of the partitioner, or else relies on the value of <code>spark.default.parallelism</code> if the property is set , otherwise depends on the partition number of the RDD. You can pass an optional <code>numTasks</code> argument to set a different number of tasks. </td> </tr> |