[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error - spark

diff options

author	Sameer Agarwal <sameerag@cs.berkeley.edu>	2016-09-02 15:16:16 -0700
committer	Davies Liu <davies.liu@gmail.com>	2016-09-02 15:16:16 -0700
commit	a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a (patch)
tree	2f7b50a97fd5aa7143576b679745b796687af30c /python
parent	ed9c884dcf925500ceb388b06b33bd2c95cd2ada (diff)
download	spark-a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a.tar.gz spark-a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a.tar.bz2 spark-a2c9acb0e54b2e38cb8ee6431f1ea0e0b4cd959a.zip

[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error

## What changes were proposed in this pull request? This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. ## How was this patch tested? Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14941 from sameeragarwal/parquet-exception-2.

Diffstat (limited to 'python')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: