[SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter

This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa. Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are: 1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed). Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow. 1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools. 1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`). To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs. TODO - [x] More schema conversion test cases for legacy schema patterns. [1]: https://github.com/apache/parquet-format/blob/ea095226597fdbecd60c2419d96b54b2fdb4ae6c/LogicalTypes.md [2]: https://github.com/apache/parquet-mr/ Author: Cheng Lian <lian@databricks.com> Closes #6617 from liancheng/spark-6777 and squashes the following commits: 2a2062d [Cheng Lian] Don't convert decimals without precision information b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString 743730f [Cheng Lian] Decimal scale shouldn't be larger than precision a104a9e [Cheng Lian] Fixes Scala style issue 1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion ba84f4b [Cheng Lian] Fixes MapType schema conversion bug 13cb8d5 [Cheng Lian] Fixes MiMa failure 81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests 28ef95b [Cheng Lian] More AnalysisExceptions b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter
author: Cheng Lian <lian@databricks.com> 2015-06-24 15:03:43 -0700
committer: Cheng Lian <lian@databricks.com> 2015-06-24 15:03:43 -0700
commit: 8ab50765cd793169091d983b50d87a391f6ac1f4 (patch)
tree: 7aa7a58b10a2786b8ab0979bd4632e8ca64e78ee /project
parent: fb32c388985ce65c1083cb435cf1f7479fecbaac (diff)
download: spark-8ab50765cd793169091d983b50d87a391f6ac1f4.tar.gz
spark-8ab50765cd793169091d983b50d87a391f6ac1f4.tar.bz2
spark-8ab50765cd793169091d983b50d87a391f6ac1f4.zip
1 files changed, 6 insertions, 1 deletions
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index f678c69a6d..6f86a505b3 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -69,7 +69,12 @@ object MimaExcludes {
             ProblemFilters.exclude[MissingClassProblem](
               "org.apache.spark.sql.parquet.CatalystTimestampConverter"),
             ProblemFilters.exclude[MissingClassProblem](
-              "org.apache.spark.sql.parquet.CatalystTimestampConverter$")
+              "org.apache.spark.sql.parquet.CatalystTimestampConverter$"),
+            // SPARK-6777 Implements backwards compatibility rules in CatalystSchemaConverter
+            ProblemFilters.exclude[MissingClassProblem](
+              "org.apache.spark.sql.parquet.ParquetTypeInfo"),
+            ProblemFilters.exclude[MissingClassProblem](
+              "org.apache.spark.sql.parquet.ParquetTypeInfo$")
           )
         case v if v.startsWith("1.4") =>
           Seq(
author	Cheng Lian <lian@databricks.com>	2015-06-24 15:03:43 -0700
committer	Cheng Lian <lian@databricks.com>	2015-06-24 15:03:43 -0700
commit	8ab50765cd793169091d983b50d87a391f6ac1f4 (patch)
tree	7aa7a58b10a2786b8ab0979bd4632e8ca64e78ee /project
parent	fb32c388985ce65c1083cb435cf1f7479fecbaac (diff)
download	spark-8ab50765cd793169091d983b50d87a391f6ac1f4.tar.gz spark-8ab50765cd793169091d983b50d87a391f6ac1f4.tar.bz2 spark-8ab50765cd793169091d983b50d87a391f6ac1f4.zip