diff options
author | Liwei Lin <lwlin7@gmail.com> | 2016-09-18 19:25:58 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-09-18 19:25:58 +0100 |
commit | 1dbb725dbef30bf7633584ce8efdb573f2d92bca (patch) | |
tree | ca63691ee0b6e70ed661c95743c4c140126bb0e2 /python | |
parent | 7151011b38a841d9d4bc2e453b9a7cfe42f74f8f (diff) | |
download | spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.gz spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.bz2 spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.zip |
[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly
## Problem
CSV in Spark 2.0.0:
- does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
- does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903.
## What changes were proposed in this pull request?
This patch makes changes to read all empty values back as `null`s.
## How was this patch tested?
New test cases.
Author: Liwei Lin <lwlin7@gmail.com>
Closes #14118 from lw-lin/csv-cast-null.
Diffstat (limited to 'python')
-rw-r--r-- | python/pyspark/sql/readwriter.py | 3 | ||||
-rw-r--r-- | python/pyspark/sql/streaming.py | 3 |
2 files changed, 4 insertions, 2 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 3d79e0cccc..a6860efa89 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -329,7 +329,8 @@ class DataFrameReader(OptionUtils): being read should be skipped. If None is set, it uses the default value, ``false``. :param nullValue: sets the string representation of a null value. If None is set, it uses - the default value, empty string. + the default value, empty string. Since 2.0.1, this ``nullValue`` param + applies to all supported types including the string type. :param nanValue: sets the string representation of a non-number value. If None is set, it uses the default value, ``NaN``. :param positiveInf: sets the string representation of a positive infinity value. If None diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py index 67375f6b5f..01364517ed 100644 --- a/python/pyspark/sql/streaming.py +++ b/python/pyspark/sql/streaming.py @@ -497,7 +497,8 @@ class DataStreamReader(OptionUtils): being read should be skipped. If None is set, it uses the default value, ``false``. :param nullValue: sets the string representation of a null value. If None is set, it uses - the default value, empty string. + the default value, empty string. Since 2.0.1, this ``nullValue`` param + applies to all supported types including the string type. :param nanValue: sets the string representation of a non-number value. If None is set, it uses the default value, ``NaN``. :param positiveInf: sets the string representation of a positive infinity value. If None |