[SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading - spark

diff options

author	hyukjinkwon <gurwls223@gmail.com>	2016-12-24 23:28:34 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-12-24 23:28:34 +0800
commit	d6cbec7598b7aea33f588849e6e2e324b8820340 (patch)
tree	0c9478e2f061175ced461015080f2a9d8ab0d149 /core/src
parent	f2ceb2abe9357942a51bd643683850efd1fc9df7 (diff)
download	spark-d6cbec7598b7aea33f588849e6e2e324b8820340.tar.gz spark-d6cbec7598b7aea33f588849e6e2e324b8820340.tar.bz2 spark-d6cbec7598b7aea33f588849e6e2e324b8820340.zip

[SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading

## What changes were proposed in this pull request? `CSVRelation.csvParser` does type dispatch for each value in each row. We can prevent this because the schema is already kept in `CSVRelation`. So, this PR proposes that converters are created first according to the schema, and then apply them to each. I just ran some small benchmarks as below after resembling the logics in https://github.com/apache/spark/blob/7c33b0fd050f3d2b08c1cfd7efbff8166832c1af/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L170-L178 to test the updated logics. ```scala test("Benchmark for CSV converter") { var numMalformedRecords = 0 val N = 500 << 12 val schema = StructType( StructField("a", StringType) :: StructField("b", StringType) :: StructField("c", StringType) :: StructField("d", StringType) :: Nil) val row = Array("1.0", "test", "2015-08-20 14:57:00", "FALSE") val data = spark.sparkContext.parallelize(List.fill(N)(row)) val parser = CSVRelation.csvParser(schema, schema.fieldNames, CSVOptions()) val benchmark = new Benchmark("CSV converter", N) benchmark.addCase("cast CSV string tokens", 10) { _ => data.flatMap { recordTokens => parser(recordTokens, numMalformedRecords) }.collect() } benchmark.run() } ``` **Before** ``` CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ cast CSV string tokens 1061 / 1130 1.9 517.9 1.0X ``` **After** ``` CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ cast CSV string tokens 940 / 1011 2.2 459.2 1.0X ``` ## How was this patch tested? Tests in `CSVTypeCastSuite` and `CSVRelation` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16351 from HyukjinKwon/type-dispatch.

Diffstat (limited to 'core/src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: