aboutsummaryrefslogtreecommitdiff
path: root/bagel/pom.xml
diff options
context:
space:
mode:
authorEric Moyer <eric_moyer@yahoo.com>2015-01-08 11:55:23 -0800
committerAndrew Or <andrew@databricks.com>2015-01-08 11:55:23 -0800
commit538f221627930c8f8a138c0d21d9fa09bc789e67 (patch)
tree9622c9db88df2da47bcce752e51b3f77e19c1e1e /bagel/pom.xml
parent0760787da885187b0c6dcd5c28753f0ab014d5ed (diff)
downloadspark-538f221627930c8f8a138c0d21d9fa09bc789e67.tar.gz
spark-538f221627930c8f8a138c0d21d9fa09bc789e67.tar.bz2
spark-538f221627930c8f8a138c0d21d9fa09bc789e67.zip
Document that groupByKey will OOM for large keys
This pull request is my own work and I license it under Spark's open-source license. This contribution is an improvement to the documentation. I documented that the maximum number of values per key for groupByKey is limited by available RAM (see [Datablox][datablox link] and [the spark mailing list][list link]). Just saying that better performance is available is not sufficient. Sometimes you need to do a group-by - your operation needs all the items available in order to complete. This warning explains the problem. [datablox link]: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html [list link]: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html Author: Eric Moyer <eric_moyer@yahoo.com> Closes #3936 from RadixSeven/better-group-by-docs and squashes the following commits: 5b6f4e9 [Eric Moyer] groupByKey docs naming updates 238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys
Diffstat (limited to 'bagel/pom.xml')
0 files changed, 0 insertions, 0 deletions