Merge pull request #377 from andrewor14/master

External Sorting for Aggregator and CoGroupedRDDs (Revisited) (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving) The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted. The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order. Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
author: Patrick Wendell <pwendell@gmail.com> 2014-01-10 16:25:01 -0800
committer: Patrick Wendell <pwendell@gmail.com> 2014-01-10 16:25:01 -0800
commit: d37408f39ca3fd94f45b50a65f919f4d7007a533 (patch)
tree: 156e7f6639c22f919a932db2a9b90e803d26c94d /docs/configuration.md
parent: 0eaf01c5ed856c9aeb60c0841c3be9305c6da174 (diff)
parent: 2e393cd5fdfbf3a85fced370b5c42315e86dad49 (diff)
download: spark-d37408f39ca3fd94f45b50a65f919f4d7007a533.tar.gz
spark-d37408f39ca3fd94f45b50a65f919f4d7007a533.tar.bz2
spark-d37408f39ca3fd94f45b50a65f919f4d7007a533.zip
1 files changed, 21 insertions, 2 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index b1a0e19167..ad75e06fc7 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -104,14 +104,25 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td>spark.storage.memoryFraction</td>
-  <td>0.66</td>
+  <td>0.6</td>
   <td>
     Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old"
-    generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase
+    generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase
     it if you configure your own old generation size.
   </td>
 </tr>
 <tr>
+  <td>spark.shuffle.memoryFraction</td>
+  <td>0.3</td>
+  <td>
+    Fraction of Java heap to use for aggregation and cogroups during shuffles, if
+    <code>spark.shuffle.externalSorting</code> is enabled. At any given time, the collective size of
+    all in-memory maps used for shuffles is bounded by this limit, beyond which the contents will
+    begin to spill to disk. If spills are often, consider increasing this value at the expense of
+    <code>spark.storage.memoryFraction</code>.
+  </td>
+</tr>
+<tr>
   <td>spark.mesos.coarse</td>
   <td>false</td>
   <td>
@@ -377,6 +388,14 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
+  <td>spark.shuffle.externalSorting</td>
+  <td>true</td>
+  <td>
+    If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling
+    threshold is specified by <code>spark.shuffle.memoryFraction</code>.
+  </td>
+</tr>
+<tr>
   <td>spark.speculation</td>
   <td>false</td>
   <td>
author	Patrick Wendell <pwendell@gmail.com>	2014-01-10 16:25:01 -0800
committer	Patrick Wendell <pwendell@gmail.com>	2014-01-10 16:25:01 -0800
commit	d37408f39ca3fd94f45b50a65f919f4d7007a533 (patch)
tree	156e7f6639c22f919a932db2a9b90e803d26c94d /docs/configuration.md
parent	0eaf01c5ed856c9aeb60c0841c3be9305c6da174 (diff)
parent	2e393cd5fdfbf3a85fced370b5c42315e86dad49 (diff)
download	spark-d37408f39ca3fd94f45b50a65f919f4d7007a533.tar.gz spark-d37408f39ca3fd94f45b50a65f919f4d7007a533.tar.bz2 spark-d37408f39ca3fd94f45b50a65f919f4d7007a533.zip