[SPARK-14217] [SQL] Fix bug if parquet data has columns that use dictionary encoding for some of the data - spark

diff options

author	Nong Li <nong@databricks.com>	2016-04-09 17:45:10 -0700
committer	Davies Liu <davies.liu@gmail.com>	2016-04-09 17:45:10 -0700
commit	5989c85b535f7f623392d6456d8b37052487f24b (patch)
tree	48fa983954f6e631b06954ec4177c45c7fcd84a6 /python
parent	5cb5edaf9c5054e42d41f20b2dd92dafcccbf0d6 (diff)
download	spark-5989c85b535f7f623392d6456d8b37052487f24b.tar.gz spark-5989c85b535f7f623392d6456d8b37052487f24b.tar.bz2 spark-5989c85b535f7f623392d6456d8b37052487f24b.zip

[SPARK-14217] [SQL] Fix bug if parquet data has columns that use dictionary encoding for some of the data

## What changes were proposed in this pull request? This PR is based on #12017 Currently, this causes batches where some values are dictionary encoded and some which are not. The non-dictionary encoded values cause us to remove the dictionary from the batch causing the first values to return garbage. This patch fixes the issue by first decoding the dictionary for the values that are already dictionary encoded before switching. A similar thing is done for the reverse case where the initial values are not dictionary encoded. ## How was this patch tested? This is difficult to test but replicated on a test cluster using a large tpcds data set. Author: Nong Li <nong@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #12279 from davies/fix_dict.

Diffstat (limited to 'python')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: