diff options
author | Cheng Lian <lian@databricks.com> | 2015-08-12 20:01:34 +0800 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2015-08-12 20:01:34 +0800 |
commit | 3ecb3794302dc12d0989f8d725483b2cc37762cf (patch) | |
tree | b8c3a132482fe273a71f3f9bb2235bebc395744e /sql/core/src/test/thrift | |
parent | 9d0822455ddc8d765440d58c463367a4d67ef456 (diff) | |
download | spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.gz spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.bz2 spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.zip |
[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down
This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.
In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.
This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.
The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.
Author: Cheng Lian <lian@databricks.com>
Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
Diffstat (limited to 'sql/core/src/test/thrift')
-rw-r--r-- | sql/core/src/test/thrift/parquet-compat.thrift | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/core/src/test/thrift/parquet-compat.thrift b/sql/core/src/test/thrift/parquet-compat.thrift index fa5ed8c623..98bf778aec 100644 --- a/sql/core/src/test/thrift/parquet-compat.thrift +++ b/sql/core/src/test/thrift/parquet-compat.thrift @@ -15,7 +15,7 @@ * limitations under the License. */ -namespace java org.apache.spark.sql.parquet.test.thrift +namespace java org.apache.spark.sql.execution.datasources.parquet.test.thrift enum Suit { SPADES, |