[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down

This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
author: Cheng Lian <lian@databricks.com> 2015-08-12 20:01:34 +0800
committer: Cheng Lian <lian@databricks.com> 2015-08-12 20:01:34 +0800
commit: 3ecb3794302dc12d0989f8d725483b2cc37762cf (patch)
tree: b8c3a132482fe273a71f3f9bb2235bebc395744e /sql/core/src/test/avro
parent: 9d0822455ddc8d765440d58c463367a4d67ef456 (diff)
download: spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.gz
spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.bz2
spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.zip
2 files changed, 24 insertions, 2 deletions
diff --git a/sql/core/src/test/avro/parquet-compat.avdl b/sql/core/src/test/avro/parquet-compat.avdl
index 24729f6143..8070d0a917 100644
--- a/sql/core/src/test/avro/parquet-compat.avdl
+++ b/sql/core/src/test/avro/parquet-compat.avdl
@@ -16,8 +16,19 @@
  */
 
 // This is a test protocol for testing parquet-avro compatibility.
-@namespace("org.apache.spark.sql.parquet.test.avro")
+@namespace("org.apache.spark.sql.execution.datasources.parquet.test.avro")
 protocol CompatibilityTest {
+    enum Suit {
+        SPADES,
+        HEARTS,
+        DIAMONDS,
+        CLUBS
+    }
+
+    record ParquetEnum {
+        Suit suit;
+    }
+
     record Nested {
         array<int> nested_ints_column;
         string nested_string_column;
diff --git a/sql/core/src/test/avro/parquet-compat.avpr b/sql/core/src/test/avro/parquet-compat.avpr
index a83b7c990d..0603917650 100644
--- a/sql/core/src/test/avro/parquet-compat.avpr
+++ b/sql/core/src/test/avro/parquet-compat.avpr
@@ -1,7 +1,18 @@
 {
   "protocol" : "CompatibilityTest",
-  "namespace" : "org.apache.spark.sql.parquet.test.avro",
+  "namespace" : "org.apache.spark.sql.execution.datasources.parquet.test.avro",
   "types" : [ {
+    "type" : "enum",
+    "name" : "Suit",
+    "symbols" : [ "SPADES", "HEARTS", "DIAMONDS", "CLUBS" ]
+  }, {
+    "type" : "record",
+    "name" : "ParquetEnum",
+    "fields" : [ {
+      "name" : "suit",
+      "type" : "Suit"
+    } ]
+  }, {
     "type" : "record",
     "name" : "Nested",
     "fields" : [ {
author	Cheng Lian <lian@databricks.com>	2015-08-12 20:01:34 +0800
committer	Cheng Lian <lian@databricks.com>	2015-08-12 20:01:34 +0800
commit	3ecb3794302dc12d0989f8d725483b2cc37762cf (patch)
tree	b8c3a132482fe273a71f3f9bb2235bebc395744e /sql/core/src/test/avro
parent	9d0822455ddc8d765440d58c463367a4d67ef456 (diff)
download	spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.gz spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.bz2 spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.zip