[SPARK-16764][SQL] Recommend disabling vectorized parquet reader on OutOfMemoryError

## What changes were proposed in this pull request? We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader. ## How was this patch tested? Existing Tests Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14387 from sameeragarwal/oom.
author: Sameer Agarwal <sameerag@cs.berkeley.edu> 2016-07-28 13:04:19 -0700
committer: Reynold Xin <rxin@databricks.com> 2016-07-28 13:04:19 -0700
commit: 3fd39b87bda77f3c3a4622d854f23d4234683571 (patch)
tree: b8cd04695535ac24e1e644e02d8e5fafa687273a
parent: 1178d61ede816bf1c8d5bb3dbb3b965c9b944407 (diff)
download: spark-3fd39b87bda77f3c3a4622d854f23d4234683571.tar.gz
spark-3fd39b87bda77f3c3a4622d854f23d4234683571.tar.bz2
spark-3fd39b87bda77f3c3a4622d854f23d4234683571.zip
1 files changed, 19 insertions, 5 deletions
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
index bbbb796aca..59173d253b 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
@@ -282,16 +282,30 @@ public abstract class ColumnVector implements AutoCloseable {
     if (requiredCapacity > capacity) {
       int newCapacity = (int) Math.min(MAX_CAPACITY, requiredCapacity * 2L);
       if (requiredCapacity <= newCapacity) {
-        reserveInternal(newCapacity);
+        try {
+          reserveInternal(newCapacity);
+        } catch (OutOfMemoryError outOfMemoryError) {
+          throwUnsupportedException(newCapacity, requiredCapacity, outOfMemoryError);
+        }
       } else {
-        throw new RuntimeException("Cannot reserve more than " + newCapacity +
-            " bytes in the vectorized reader (requested = " + requiredCapacity + " bytes). As a " +
-            "workaround, you can disable the vectorized reader by setting "
-            + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + " to false.");
+        throwUnsupportedException(newCapacity, requiredCapacity, null);
       }
     }
   }
 
+  private void throwUnsupportedException(int newCapacity, int requiredCapacity, Throwable cause) {
+    String message = "Cannot reserve more than " + newCapacity +
+        " bytes in the vectorized reader (requested = " + requiredCapacity + " bytes). As a" +
+        " workaround, you can disable the vectorized reader by setting "
+        + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + " to false.";
+
+    if (cause != null) {
+      throw new RuntimeException(message, cause);
+    } else {
+      throw new RuntimeException(message);
+    }
+  }
+
   /**
    * Ensures that there is enough storage to store capcity elements. That is, the put() APIs
    * must work for all rowIds < capcity.
author	Sameer Agarwal <sameerag@cs.berkeley.edu>	2016-07-28 13:04:19 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-07-28 13:04:19 -0700
commit	3fd39b87bda77f3c3a4622d854f23d4234683571 (patch)
tree	b8cd04695535ac24e1e644e02d8e5fafa687273a
parent	1178d61ede816bf1c8d5bb3dbb3b965c9b944407 (diff)
download	spark-3fd39b87bda77f3c3a4622d854f23d4234683571.tar.gz spark-3fd39b87bda77f3c3a4622d854f23d4234683571.tar.bz2 spark-3fd39b87bda77f3c3a4622d854f23d4234683571.zip