[SPARK-18352][SQL] Support parsing multiline json files - spark

diff options

author	Nathan Howell <nhowell@godaddy.com>	2017-02-16 20:51:19 -0800
committer	Wenchen Fan <wenchen@databricks.com>	2017-02-16 20:51:19 -0800
commit	21fde57f15db974b710e7b00e72c744da7c1ac3c (patch)
tree	e51d0ab5ad405ff66c6459738186406a597a8f1c /README.md
parent	dcc2d540a53f0bd04baead43fdee1c170ef2b9f3 (diff)
download	spark-21fde57f15db974b710e7b00e72c744da7c1ac3c.tar.gz spark-21fde57f15db974b710e7b00e72c744da7c1ac3c.tar.bz2 spark-21fde57f15db974b710e7b00e72c744da7c1ac3c.zip

[SPARK-18352][SQL] Support parsing multiline json files

## What changes were proposed in this pull request? If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory. Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired. These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing. I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one. ## How was this patch tested? New and existing unit tests. No performance or load tests have been run. Author: Nathan Howell <nhowell@godaddy.com> Closes #16386 from NathanHowell/SPARK-18352.

Diffstat (limited to 'README.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: