aboutsummaryrefslogtreecommitdiff
path: root/sql/core/src/test/scripts
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2015-08-12 20:01:34 +0800
committerCheng Lian <lian@databricks.com>2015-08-12 20:01:34 +0800
commit3ecb3794302dc12d0989f8d725483b2cc37762cf (patch)
treeb8c3a132482fe273a71f3f9bb2235bebc395744e /sql/core/src/test/scripts
parent9d0822455ddc8d765440d58c463367a4d67ef456 (diff)
downloadspark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.gz
spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.tar.bz2
spark-3ecb3794302dc12d0989f8d725483b2cc37762cf.zip
[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down
This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
Diffstat (limited to 'sql/core/src/test/scripts')
-rwxr-xr-xsql/core/src/test/scripts/gen-avro.sh (renamed from sql/core/src/test/scripts/gen-code.sh)13
-rwxr-xr-xsql/core/src/test/scripts/gen-thrift.sh27
2 files changed, 33 insertions, 7 deletions
diff --git a/sql/core/src/test/scripts/gen-code.sh b/sql/core/src/test/scripts/gen-avro.sh
index 5d8d8ad085..48174b287f 100755
--- a/sql/core/src/test/scripts/gen-code.sh
+++ b/sql/core/src/test/scripts/gen-avro.sh
@@ -22,10 +22,9 @@ cd -
rm -rf $BASEDIR/gen-java
mkdir -p $BASEDIR/gen-java
-thrift\
- --gen java\
- -out $BASEDIR/gen-java\
- $BASEDIR/thrift/parquet-compat.thrift
-
-avro-tools idl $BASEDIR/avro/parquet-compat.avdl > $BASEDIR/avro/parquet-compat.avpr
-avro-tools compile -string protocol $BASEDIR/avro/parquet-compat.avpr $BASEDIR/gen-java
+for input in `ls $BASEDIR/avro/*.avdl`; do
+ filename=$(basename "$input")
+ filename="${filename%.*}"
+ avro-tools idl $input> $BASEDIR/avro/${filename}.avpr
+ avro-tools compile -string protocol $BASEDIR/avro/${filename}.avpr $BASEDIR/gen-java
+done
diff --git a/sql/core/src/test/scripts/gen-thrift.sh b/sql/core/src/test/scripts/gen-thrift.sh
new file mode 100755
index 0000000000..ada432c68a
--- /dev/null
+++ b/sql/core/src/test/scripts/gen-thrift.sh
@@ -0,0 +1,27 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+cd $(dirname $0)/..
+BASEDIR=`pwd`
+cd -
+
+rm -rf $BASEDIR/gen-java
+mkdir -p $BASEDIR/gen-java
+
+for input in `ls $BASEDIR/thrift/*.thrift`; do
+ thrift --gen java -out $BASEDIR/gen-java $input
+done