aboutsummaryrefslogtreecommitdiff
path: root/project
diff options
context:
space:
mode:
authorCheng Lian <lian@databricks.com>2015-06-24 15:03:43 -0700
committerCheng Lian <lian@databricks.com>2015-06-24 15:03:43 -0700
commit8ab50765cd793169091d983b50d87a391f6ac1f4 (patch)
tree7aa7a58b10a2786b8ab0979bd4632e8ca64e78ee /project
parentfb32c388985ce65c1083cb435cf1f7479fecbaac (diff)
downloadspark-8ab50765cd793169091d983b50d87a391f6ac1f4.tar.gz
spark-8ab50765cd793169091d983b50d87a391f6ac1f4.tar.bz2
spark-8ab50765cd793169091d983b50d87a391f6ac1f4.zip
[SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter
This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa. Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are: 1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed). Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow. 1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools. 1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`). To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs. TODO - [x] More schema conversion test cases for legacy schema patterns. [1]: https://github.com/apache/parquet-format/blob/ea095226597fdbecd60c2419d96b54b2fdb4ae6c/LogicalTypes.md [2]: https://github.com/apache/parquet-mr/ Author: Cheng Lian <lian@databricks.com> Closes #6617 from liancheng/spark-6777 and squashes the following commits: 2a2062d [Cheng Lian] Don't convert decimals without precision information b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString 743730f [Cheng Lian] Decimal scale shouldn't be larger than precision a104a9e [Cheng Lian] Fixes Scala style issue 1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion ba84f4b [Cheng Lian] Fixes MapType schema conversion bug 13cb8d5 [Cheng Lian] Fixes MiMa failure 81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests 28ef95b [Cheng Lian] More AnalysisExceptions b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter
Diffstat (limited to 'project')
-rw-r--r--project/MimaExcludes.scala7
1 files changed, 6 insertions, 1 deletions
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index f678c69a6d..6f86a505b3 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -69,7 +69,12 @@ object MimaExcludes {
ProblemFilters.exclude[MissingClassProblem](
"org.apache.spark.sql.parquet.CatalystTimestampConverter"),
ProblemFilters.exclude[MissingClassProblem](
- "org.apache.spark.sql.parquet.CatalystTimestampConverter$")
+ "org.apache.spark.sql.parquet.CatalystTimestampConverter$"),
+ // SPARK-6777 Implements backwards compatibility rules in CatalystSchemaConverter
+ ProblemFilters.exclude[MissingClassProblem](
+ "org.apache.spark.sql.parquet.ParquetTypeInfo"),
+ ProblemFilters.exclude[MissingClassProblem](
+ "org.apache.spark.sql.parquet.ParquetTypeInfo$")
)
case v if v.startsWith("1.4") =>
Seq(