aboutsummaryrefslogtreecommitdiff
path: root/.rat-excludes
diff options
context:
space:
mode:
authorDamian Guy <damian.guy@gmail.com>2015-08-11 12:46:33 +0800
committerCheng Lian <lian@databricks.com>2015-08-11 12:46:33 +0800
commit071bbad5db1096a548c886762b611a8484a52753 (patch)
tree5ef7be83e9fa717f01a04d9ccfdb5dfb5d9938c1 /.rat-excludes
parent3c9802d9400bea802984456683b2736a450ee17e (diff)
downloadspark-071bbad5db1096a548c886762b611a8484a52753.tar.gz
spark-071bbad5db1096a548c886762b611a8484a52753.tar.bz2
spark-071bbad5db1096a548c886762b611a8484a52753.zip
[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists
This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists
Diffstat (limited to '.rat-excludes')
-rw-r--r--.rat-excludes1
1 files changed, 1 insertions, 0 deletions
diff --git a/.rat-excludes b/.rat-excludes
index 7277146584..9165872b9f 100644
--- a/.rat-excludes
+++ b/.rat-excludes
@@ -94,3 +94,4 @@ INDEX
gen-java.*
.*avpr
org.apache.spark.sql.sources.DataSourceRegister
+.*parquet