diff options
author | Takeshi Yamamuro <yamamuro@apache.org> | 2017-02-23 12:09:36 -0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2017-02-23 12:09:36 -0800 |
commit | 09ed6e7711d0758c24944516a263b8bd4e1728fc (patch) | |
tree | 14decfedc993886ff382f9313f042053dc564f48 /README.md | |
parent | 9bf4e2baad0e2851da554d85223ffaa029cfa490 (diff) | |
download | spark-09ed6e7711d0758c24944516a263b8bd4e1728fc.tar.gz spark-09ed6e7711d0758c24944516a263b8bd4e1728fc.tar.bz2 spark-09ed6e7711d0758c24944516a263b8bd4e1728fc.zip |
[SPARK-18699][SQL] Put malformed tokens into a new field when parsing CSV data
## What changes were proposed in this pull request?
This pr added a logic to put malformed tokens into a new field when parsing CSV data in case of permissive modes. In the current master, if the CSV parser hits these malformed ones, it throws an exception below (and then a job fails);
```
Caused by: java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
at
```
In case that users load large CSV-formatted data, the job failure makes users get some confused. So, this fix set NULL for original columns and put malformed tokens in a new field.
## How was this patch tested?
Added tests in `CSVSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes #16928 from maropu/SPARK-18699-2.
Diffstat (limited to 'README.md')
0 files changed, 0 insertions, 0 deletions