diff options
author | Damian Guy <damian.guy@gmail.com> | 2015-08-11 12:46:33 +0800 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2015-08-11 12:46:33 +0800 |
commit | 071bbad5db1096a548c886762b611a8484a52753 (patch) | |
tree | 5ef7be83e9fa717f01a04d9ccfdb5dfb5d9938c1 /sql/core/src/test/resources | |
parent | 3c9802d9400bea802984456683b2736a450ee17e (diff) | |
download | spark-071bbad5db1096a548c886762b611a8484a52753.tar.gz spark-071bbad5db1096a548c886762b611a8484a52753.tar.bz2 spark-071bbad5db1096a548c886762b611a8484a52753.zip |
[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists
This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR.
**Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".**
----
SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`:
> This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field.
One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays.
This PR fixes this issue by
1. Handling unannotated repeated fields in `CatalystSchemaConverter`.
2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`.
Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`.
Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream.
Author: Cheng Lian <lian@databricks.com>
Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits:
ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite
f1c7bfd [Cheng Lian] Updates .rat-excludes
420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists
Diffstat (limited to 'sql/core/src/test/resources')
-rw-r--r-- | sql/core/src/test/resources/nested-array-struct.parquet | bin | 0 -> 775 bytes |
-rw-r--r-- | sql/core/src/test/resources/old-repeated-int.parquet | bin | 0 -> 389 bytes |
-rw-r--r-- | sql/core/src/test/resources/old-repeated-message.parquet | bin | 0 -> 600 bytes |
-rw-r--r-- | sql/core/src/test/resources/old-repeated.parquet | bin | 0 -> 432 bytes |
-rw-r--r--[-rwxr-xr-x] | sql/core/src/test/resources/parquet-thrift-compat.snappy.parquet | bin | 10550 -> 10550 bytes |
-rw-r--r-- | sql/core/src/test/resources/proto-repeated-string.parquet | bin | 0 -> 411 bytes |
-rw-r--r-- | sql/core/src/test/resources/proto-repeated-struct.parquet | bin | 0 -> 608 bytes |
-rw-r--r-- | sql/core/src/test/resources/proto-struct-with-array-many.parquet | bin | 0 -> 802 bytes |
-rw-r--r-- | sql/core/src/test/resources/proto-struct-with-array.parquet | bin | 0 -> 1576 bytes |
9 files changed, 0 insertions, 0 deletions
diff --git a/sql/core/src/test/resources/nested-array-struct.parquet b/sql/core/src/test/resources/nested-array-struct.parquet Binary files differnew file mode 100644 index 0000000000..41a43fa35d --- /dev/null +++ b/sql/core/src/test/resources/nested-array-struct.parquet diff --git a/sql/core/src/test/resources/old-repeated-int.parquet b/sql/core/src/test/resources/old-repeated-int.parquet Binary files differnew file mode 100644 index 0000000000..520922f73e --- /dev/null +++ b/sql/core/src/test/resources/old-repeated-int.parquet diff --git a/sql/core/src/test/resources/old-repeated-message.parquet b/sql/core/src/test/resources/old-repeated-message.parquet Binary files differnew file mode 100644 index 0000000000..548db99162 --- /dev/null +++ b/sql/core/src/test/resources/old-repeated-message.parquet diff --git a/sql/core/src/test/resources/old-repeated.parquet b/sql/core/src/test/resources/old-repeated.parquet Binary files differnew file mode 100644 index 0000000000..213f1a9029 --- /dev/null +++ b/sql/core/src/test/resources/old-repeated.parquet diff --git a/sql/core/src/test/resources/parquet-thrift-compat.snappy.parquet b/sql/core/src/test/resources/parquet-thrift-compat.snappy.parquet Binary files differindex 837e4876ee..837e4876ee 100755..100644 --- a/sql/core/src/test/resources/parquet-thrift-compat.snappy.parquet +++ b/sql/core/src/test/resources/parquet-thrift-compat.snappy.parquet diff --git a/sql/core/src/test/resources/proto-repeated-string.parquet b/sql/core/src/test/resources/proto-repeated-string.parquet Binary files differnew file mode 100644 index 0000000000..8a7eea601d --- /dev/null +++ b/sql/core/src/test/resources/proto-repeated-string.parquet diff --git a/sql/core/src/test/resources/proto-repeated-struct.parquet b/sql/core/src/test/resources/proto-repeated-struct.parquet Binary files differnew file mode 100644 index 0000000000..c29eee35c3 --- /dev/null +++ b/sql/core/src/test/resources/proto-repeated-struct.parquet diff --git a/sql/core/src/test/resources/proto-struct-with-array-many.parquet b/sql/core/src/test/resources/proto-struct-with-array-many.parquet Binary files differnew file mode 100644 index 0000000000..ff9809675f --- /dev/null +++ b/sql/core/src/test/resources/proto-struct-with-array-many.parquet diff --git a/sql/core/src/test/resources/proto-struct-with-array.parquet b/sql/core/src/test/resources/proto-struct-with-array.parquet Binary files differnew file mode 100644 index 0000000000..325a8370ad --- /dev/null +++ b/sql/core/src/test/resources/proto-struct-with-array.parquet |