[SPARK-14480][SQL] Remove meaningless StringIteratorReader for CSV data source. - spark

diff options

author	hyukjinkwon <gurwls223@gmail.com>	2016-06-29 11:42:51 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-06-29 11:42:51 -0700
commit	cb1b9d34f37a5574de43f61e7036c4b8b81defbf (patch)
tree	4729d676c34ba492f804e1b79e44d132a66f60d3 /docs/img/structured-streaming-late-data.png
parent	39f2eb1da34f26bf68c535c8e6b796d71a37a651 (diff)
download	spark-cb1b9d34f37a5574de43f61e7036c4b8b81defbf.tar.gz spark-cb1b9d34f37a5574de43f61e7036c4b8b81defbf.tar.bz2 spark-cb1b9d34f37a5574de43f61e7036c4b8b81defbf.zip

[SPARK-14480][SQL] Remove meaningless StringIteratorReader for CSV data source.

## What changes were proposed in this pull request? This PR removes meaningless `StringIteratorReader` for CSV data source. In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this. Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). A benchmark was performed manually and the results were below: - Original codes with Reader wrapping Iterator |End-to-end (ns) | Parse Time (ns) | |-----------------------|------------------------| |14116265034 |2008277960 | - New codes with Iterator |End-to-end (ns) | Parse Time (ns) | |-----------------------|------------------------| |13451699644 | 1549050564 | For the details for the environment, dataset and methods, please refer the JIRA ticket. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13808 from HyukjinKwon/SPARK-14480-small.

Diffstat (limited to 'docs/img/structured-streaming-late-data.png')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: