aboutsummaryrefslogtreecommitdiff
path: root/docs/configuration.md
diff options
context:
space:
mode:
authorMatei Zaharia <matei@databricks.com>2014-08-07 18:04:49 -0700
committerReynold Xin <rxin@apache.org>2014-08-07 18:04:49 -0700
commit6906b69cf568015f20c7d7c77cbcba650e5431a9 (patch)
treefbec9283bed82e0b2f0baa8c25834f3a54fac5ba /docs/configuration.md
parent32096c2aed9978cfb9a904b4f56bb61800d17e9e (diff)
downloadspark-6906b69cf568015f20c7d7c77cbcba650e5431a9.tar.gz
spark-6906b69cf568015f20c7d7c77cbcba650e5431a9.tar.bz2
spark-6906b69cf568015f20c7d7c77cbcba650e5431a9.zip
SPARK-2787: Make sort-based shuffle write files directly when there's no sorting/aggregation and # partitions is small
As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file. On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases. Author: Matei Zaharia <matei@databricks.com> Closes #1799 from mateiz/SPARK-2787 and squashes the following commits: 88cf26a [Matei Zaharia] Fix rebase 10233af [Matei Zaharia] Review comments 398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them d0ae3c5 [Matei Zaharia] Fix some comments 90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests 31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
Diffstat (limited to 'docs/configuration.md')
-rw-r--r--docs/configuration.md18
1 files changed, 18 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 5e3eb0f087..4d27c5a918 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -281,6 +281,24 @@ Apart from these, the following properties are also available, and may be useful
overhead per reduce task, so keep it small unless you have a large amount of memory.
</td>
</tr>
+<tr>
+ <td><code>spark.shuffle.manager</code></td>
+ <td>HASH</td>
+ <td>
+ Implementation to use for shuffling data. A hash-based shuffle manager is the default, but
+ starting in Spark 1.1 there is an experimental sort-based shuffle manager that is more
+ memory-efficient in environments with small executors, such as YARN. To use that, change
+ this value to <code>SORT</code>.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.shuffle.sort.bypassMergeThreshold</code></td>
+ <td>200</td>
+ <td>
+ (Advanced) In the sort-based shuffle manager, avoid merge-sorting data if there is no
+ map-side aggregation and there are at most this many reduce partitions.
+ </td>
+</tr>
</table>
#### Spark UI