[SPARK-3466] Limit size of results that a driver collects for each action

Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data. This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it. By default, it's 1g (for backward compatibility for most the cases). In local mode, the driver and executor share the same JVM, the default setting can not protect JVM from OOM. cc mateiz Author: Davies Liu <davies@databricks.com> Closes #3003 from davies/collect and squashes the following commits: 248ed5e [Davies Liu] fix compile 272522e [Davies Liu] address comments 2c35773 [Davies Liu] add sizes in message of abort() 5d62303 [Davies Liu] address comments bc3c077 [Davies Liu] Merge branch 'master' of github.com:apache/spark into collect 11f97c5 [Davies Liu] address comments 47b144f [Davies Liu] check the size of result before send and fetch 3d81af2 [Davies Liu] address comments ca8267d [Davies Liu] limit the size of data by collect
author: Davies Liu <davies@databricks.com> 2014-11-02 00:03:51 -0700
committer: Matei Zaharia <matei@databricks.com> 2014-11-02 00:03:51 -0700
commit: 6181577e9935f46b646ba3925b873d031aa3d6ba (patch)
tree: 84a704bd54be30393f71e744351e2e74903a311c /docs/configuration.md
parent: 23f966f47523f85ba440b4080eee665271f53b5e (diff)
download: spark-6181577e9935f46b646ba3925b873d031aa3d6ba.tar.gz
spark-6181577e9935f46b646ba3925b873d031aa3d6ba.tar.bz2
spark-6181577e9935f46b646ba3925b873d031aa3d6ba.zip
1 files changed, 12 insertions, 0 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 3007706a25..099972ca1a 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -112,6 +112,18 @@ of the most common options to set are:
   </td>
 </tr>
 <tr>
+  <td><code>spark.driver.maxResultSize</code></td>
+  <td>1g</td>
+  <td>
+    Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).
+    Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size
+    is above this limit. 
+    Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
+    and memory overhead of objects in JVM). Setting a proper limit can protect the driver from
+    out-of-memory errors.
+  </td>
+</tr>
+<tr>
   <td><code>spark.serializer</code></td>
   <td>org.apache.spark.serializer.<br />JavaSerializer</td>
   <td>
author	Davies Liu <davies@databricks.com>	2014-11-02 00:03:51 -0700
committer	Matei Zaharia <matei@databricks.com>	2014-11-02 00:03:51 -0700
commit	6181577e9935f46b646ba3925b873d031aa3d6ba (patch)
tree	84a704bd54be30393f71e744351e2e74903a311c /docs/configuration.md
parent	23f966f47523f85ba440b4080eee665271f53b5e (diff)
download	spark-6181577e9935f46b646ba3925b873d031aa3d6ba.tar.gz spark-6181577e9935f46b646ba3925b873d031aa3d6ba.tar.bz2 spark-6181577e9935f46b646ba3925b873d031aa3d6ba.zip