diff options
author | hyukjinkwon <gurwls223@gmail.com> | 2016-12-24 23:28:34 +0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2016-12-24 23:28:34 +0800 |
commit | d6cbec7598b7aea33f588849e6e2e324b8820340 (patch) | |
tree | 0c9478e2f061175ced461015080f2a9d8ab0d149 /sbin/start-slave.sh | |
parent | f2ceb2abe9357942a51bd643683850efd1fc9df7 (diff) | |
download | spark-d6cbec7598b7aea33f588849e6e2e324b8820340.tar.gz spark-d6cbec7598b7aea33f588849e6e2e324b8820340.tar.bz2 spark-d6cbec7598b7aea33f588849e6e2e324b8820340.zip |
[SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading
## What changes were proposed in this pull request?
`CSVRelation.csvParser` does type dispatch for each value in each row. We can prevent this because the schema is already kept in `CSVRelation`.
So, this PR proposes that converters are created first according to the schema, and then apply them to each.
I just ran some small benchmarks as below after resembling the logics in https://github.com/apache/spark/blob/7c33b0fd050f3d2b08c1cfd7efbff8166832c1af/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L170-L178 to test the updated logics.
```scala
test("Benchmark for CSV converter") {
var numMalformedRecords = 0
val N = 500 << 12
val schema = StructType(
StructField("a", StringType) ::
StructField("b", StringType) ::
StructField("c", StringType) ::
StructField("d", StringType) :: Nil)
val row = Array("1.0", "test", "2015-08-20 14:57:00", "FALSE")
val data = spark.sparkContext.parallelize(List.fill(N)(row))
val parser = CSVRelation.csvParser(schema, schema.fieldNames, CSVOptions())
val benchmark = new Benchmark("CSV converter", N)
benchmark.addCase("cast CSV string tokens", 10) { _ =>
data.flatMap { recordTokens =>
parser(recordTokens, numMalformedRecords)
}.collect()
}
benchmark.run()
}
```
**Before**
```
CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
cast CSV string tokens 1061 / 1130 1.9 517.9 1.0X
```
**After**
```
CSV converter: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
cast CSV string tokens 940 / 1011 2.2 459.2 1.0X
```
## How was this patch tested?
Tests in `CSVTypeCastSuite` and `CSVRelation`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #16351 from HyukjinKwon/type-dispatch.
Diffstat (limited to 'sbin/start-slave.sh')
0 files changed, 0 insertions, 0 deletions