[SPARK-16847][SQL] Prevent to potentially read corrupt statstics on binary in Parquet vectorized reader - spark

diff options

author	hyukjinkwon <gurwls223@gmail.com>	2016-08-06 04:40:24 +0100
committer	Sean Owen <sowen@cloudera.com>	2016-08-06 04:40:24 +0100
commit	55d6dad6f21dd4d50d168f6392242aa8e24b774a (patch)
tree	37d9c7bf55fc57c74b19dc7010e58d2290407987 /python/docs
parent	e679bc3c1cd418ef0025d2ecbc547c9660cac433 (diff)
download	spark-55d6dad6f21dd4d50d168f6392242aa8e24b774a.tar.gz spark-55d6dad6f21dd4d50d168f6392242aa8e24b774a.tar.bz2 spark-55d6dad6f21dd4d50d168f6392242aa8e24b774a.zip

[SPARK-16847][SQL] Prevent to potentially read corrupt statstics on binary in Parquet vectorized reader

## What changes were proposed in this pull request? This problem was found in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and we disabled filter pushdown on binary columns in Spark before. We enabled this after upgrading Parquet but it seems there is potential incompatibility for Parquet files written in lower Spark versions. Currently, this does not happen in normal Parquet reader. However, In Spark, we implemented a vectorized reader, separately with Parquet's standard API. For normal Parquet reader this is being handled but not in the vectorized reader. It is okay to just pass `FileMetaData`. This is being handled in parquet-mr (See https://github.com/apache/parquet-mr/commit/e3b95020f777eb5e0651977f654c1662e3ea1f29). This will prevent loading corrupt statistics in each page in Parquet. This PR replaces the deprecated usage of constructor. ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #14450 from HyukjinKwon/SPARK-16847.

Diffstat (limited to 'python/docs')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: