diff options
author | Damian Guy <damian.guy@gmail.com> | 2015-08-11 12:46:33 +0800 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2015-08-11 12:46:33 +0800 |
commit | 071bbad5db1096a548c886762b611a8484a52753 (patch) | |
tree | 5ef7be83e9fa717f01a04d9ccfdb5dfb5d9938c1 /.rat-excludes | |
parent | 3c9802d9400bea802984456683b2736a450ee17e (diff) | |
download | spark-071bbad5db1096a548c886762b611a8484a52753.tar.gz spark-071bbad5db1096a548c886762b611a8484a52753.tar.bz2 spark-071bbad5db1096a548c886762b611a8484a52753.zip |
[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists
This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR.
**Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".**
----
SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`:
> This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field.
One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays.
This PR fixes this issue by
1. Handling unannotated repeated fields in `CatalystSchemaConverter`.
2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`.
Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`.
Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream.
Author: Cheng Lian <lian@databricks.com>
Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits:
ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite
f1c7bfd [Cheng Lian] Updates .rat-excludes
420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists
Diffstat (limited to '.rat-excludes')
-rw-r--r-- | .rat-excludes | 1 |
1 files changed, 1 insertions, 0 deletions
diff --git a/.rat-excludes b/.rat-excludes index 7277146584..9165872b9f 100644 --- a/.rat-excludes +++ b/.rat-excludes @@ -94,3 +94,4 @@ INDEX gen-java.* .*avpr org.apache.spark.sql.sources.DataSourceRegister +.*parquet |