diff options
author | Marcelo Vanzin <vanzin@cloudera.com> | 2016-07-20 13:00:22 +0800 |
---|---|---|
committer | Cheng Lian <lian@databricks.com> | 2016-07-20 13:00:22 +0800 |
commit | 75146be6ba5e9f559f5f15430310bb476ee0812c (patch) | |
tree | 727adf7fb3d8edeaffb41839678a09771d36b399 /mllib | |
parent | fc23263623d5dcd1167fa93c094fe41ace77c326 (diff) | |
download | spark-75146be6ba5e9f559f5f15430310bb476ee0812c.tar.gz spark-75146be6ba5e9f559f5f15430310bb476ee0812c.tar.bz2 spark-75146be6ba5e9f559f5f15430310bb476ee0812c.zip |
[SPARK-16632][SQL] Respect Hive schema when merging parquet schema.
When Hive (or at least certain versions of Hive) creates parquet files
containing tinyint or smallint columns, it stores them as int32, but
doesn't annotate the parquet field as containing the corresponding
int8 / int16 data. When Spark reads those files using the vectorized
reader, it follows the parquet schema for these fields, but when
actually reading the data it tries to use the type fetched from
the metastore, and then fails because data has been loaded into the
wrong fields in OnHeapColumnVector.
So instead of blindly trusting the parquet schema, check whether the
Catalyst-provided schema disagrees with it, and adjust the types so
that the necessary metadata is present when loading the data into
the ColumnVector instance.
Tested with unit tests and with tests that create byte / short columns
in Hive and try to read them from Spark.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #14272 from vanzin/SPARK-16632.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions