[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down

This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
author: Cheng Lian <lian@databricks.com> 2015-08-12 20:01:34 +0800
committer: Cheng Lian <lian@databricks.com> 2015-08-12 20:01:34 +0800
commit: 3ecb3794302dc12d0989f8d725483b2cc37762cf (patch)
tree: b8c3a132482fe273a71f3f9bb2235bebc395744e /sql/core/src/test/README.md
parent: 9d0822455ddc8d765440d58c463367a4d67ef456 (diff)
download: spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.gz
spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.bz2
spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.zip
1 files changed, 6 insertions, 10 deletions
diff --git a/sql/core/src/test/README.md b/sql/core/src/test/README.md
index 3dd9861b48..421c2ea4f7 100644
--- a/sql/core/src/test/README.md
+++ b/sql/core/src/test/README.md
@@ -6,23 +6,19 @@ The following directories and files are used for Parquet compatibility tests:
 .
 ├── README.md                   # This file
 ├── avro
-│   ├── parquet-compat.avdl     # Testing Avro IDL
-│   └── parquet-compat.avpr     # !! NO TOUCH !! Protocol file generated from parquet-compat.avdl
+│   ├── *.avdl                  # Testing Avro IDL(s)
+│   └── *.avpr                  # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
 ├── gen-java                    # !! NO TOUCH !! Generated Java code
 ├── scripts
-│   └── gen-code.sh             # Script used to generate Java code for Thrift and Avro
+│   ├── gen-avro.sh             # Script used to generate Java code for Avro
+│   └── gen-thrift.sh           # Script used to generate Java code for Thrift
 └── thrift
-    └── parquet-compat.thrift   # Testing Thrift schema
+    └── *.thrift                # Testing Thrift schema(s)
 ```
 
-Generated Java code are used in the following test suites:
-
-- `org.apache.spark.sql.parquet.ParquetAvroCompatibilitySuite`
-- `org.apache.spark.sql.parquet.ParquetThriftCompatibilitySuite`
-
 To avoid code generation during build time, Java code generated from testing Thrift schema and Avro IDL are also checked in.
 
-When updating the testing Thrift schema and Avro IDL, please run `gen-code.sh` to update all the generated Java code.
+When updating the testing Thrift schema and Avro IDL, please run `gen-avro.sh` and `gen-thrift.sh` accordingly to update generated Java code.
 
 ## Prerequisites
author	Cheng Lian <lian@databricks.com>	2015-08-12 20:01:34 +0800
committer	Cheng Lian <lian@databricks.com>	2015-08-12 20:01:34 +0800
commit	3ecb3794302dc12d0989f8d725483b2cc37762cf (patch)
tree	b8c3a132482fe273a71f3f9bb2235bebc395744e /sql/core/src/test/README.md
parent	9d0822455ddc8d765440d58c463367a4d67ef456 (diff)
download	spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.gz spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.bz2 spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.zip