aboutsummaryrefslogtreecommitdiff
path: root/mllib/src
diff options
context:
space:
mode:
authorMarcelo Vanzin <vanzin@cloudera.com>2016-07-20 13:00:22 +0800
committerCheng Lian <lian@databricks.com>2016-07-20 13:00:22 +0800
commit75146be6ba5e9f559f5f15430310bb476ee0812c (patch)
tree727adf7fb3d8edeaffb41839678a09771d36b399 /mllib/src
parentfc23263623d5dcd1167fa93c094fe41ace77c326 (diff)
downloadspark-75146be6ba5e9f559f5f15430310bb476ee0812c.tar.gz
spark-75146be6ba5e9f559f5f15430310bb476ee0812c.tar.bz2
spark-75146be6ba5e9f559f5f15430310bb476ee0812c.zip
[SPARK-16632][SQL] Respect Hive schema when merging parquet schema.
When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14272 from vanzin/SPARK-16632.
Diffstat (limited to 'mllib/src')
0 files changed, 0 insertions, 0 deletions