[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema - spark

diff options

author	Adam Budde <budde@amazon.com>	2015-03-28 09:14:09 +0800
committer	Cheng Lian <lian@databricks.com>	2015-03-28 09:14:09 +0800
commit	5909f0973de15f685836c2828e6d4c38f57d2c19 (patch)
tree	665f653c6811b1e5ce0064a0e28d728575061abc /sbin
parent	3af7334304341fba091aa39ce2efbdfd167c697b (diff)
download	spark-5909f0973de15f685836c2828e6d4c38f57d2c19.tar.gz spark-5909f0973de15f685836c2828e6d4c38f57d2c19.tar.bz2 spark-5909f0973de15f685836c2828e6d4c38f57d2c19.zip

[SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema

Opening to replace #5188. When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore. In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema. In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore. This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there. Author: Adam Budde <budde@amazon.com> Closes #5214 from budde/nullable-fields and squashes the following commits: a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538 9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema

Diffstat (limited to 'sbin')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: