diff options
author | Eric Moyer <eric_moyer@yahoo.com> | 2015-01-08 11:55:23 -0800 |
---|---|---|
committer | Andrew Or <andrew@databricks.com> | 2015-01-08 11:55:23 -0800 |
commit | 538f221627930c8f8a138c0d21d9fa09bc789e67 (patch) | |
tree | 9622c9db88df2da47bcce752e51b3f77e19c1e1e /core | |
parent | 0760787da885187b0c6dcd5c28753f0ab014d5ed (diff) | |
download | spark-538f221627930c8f8a138c0d21d9fa09bc789e67.tar.gz spark-538f221627930c8f8a138c0d21d9fa09bc789e67.tar.bz2 spark-538f221627930c8f8a138c0d21d9fa09bc789e67.zip |
Document that groupByKey will OOM for large keys
This pull request is my own work and I license it under Spark's open-source license.
This contribution is an improvement to the documentation. I documented that the maximum number of values per key for groupByKey is limited by available RAM (see [Datablox][datablox link] and [the spark mailing list][list link]).
Just saying that better performance is available is not sufficient. Sometimes you need to do a group-by - your operation needs all the items available in order to complete. This warning explains the problem.
[datablox link]: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
[list link]: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html
Author: Eric Moyer <eric_moyer@yahoo.com>
Closes #3936 from RadixSeven/better-group-by-docs and squashes the following commits:
5b6f4e9 [Eric Moyer] groupByKey docs naming updates
238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys
Diffstat (limited to 'core')
-rw-r--r-- | core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala | 6 |
1 files changed, 6 insertions, 0 deletions
diff --git a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala index f8df5b2a08..38f8f36a4a 100644 --- a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala +++ b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala @@ -437,6 +437,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. + * + * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any + * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]]. */ def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { // groupByKey shouldn't use map side combine because map side combine does not @@ -458,6 +461,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. + * + * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any + * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]]. */ def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = { groupByKey(new HashPartitioner(numPartitions)) |