aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark
diff options
context:
space:
mode:
authorLiwei Lin <lwlin7@gmail.com>2016-09-18 19:25:58 +0100
committerSean Owen <sowen@cloudera.com>2016-09-18 19:25:58 +0100
commit1dbb725dbef30bf7633584ce8efdb573f2d92bca (patch)
treeca63691ee0b6e70ed661c95743c4c140126bb0e2 /python/pyspark
parent7151011b38a841d9d4bc2e453b9a7cfe42f74f8f (diff)
downloadspark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.gz
spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.bz2
spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.zip
[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly
## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes #14118 from lw-lin/csv-cast-null.
Diffstat (limited to 'python/pyspark')
-rw-r--r--python/pyspark/sql/readwriter.py3
-rw-r--r--python/pyspark/sql/streaming.py3
2 files changed, 4 insertions, 2 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 3d79e0cccc..a6860efa89 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -329,7 +329,8 @@ class DataFrameReader(OptionUtils):
being read should be skipped. If None is set, it uses
the default value, ``false``.
:param nullValue: sets the string representation of a null value. If None is set, it uses
- the default value, empty string.
+ the default value, empty string. Since 2.0.1, this ``nullValue`` param
+ applies to all supported types including the string type.
:param nanValue: sets the string representation of a non-number value. If None is set, it
uses the default value, ``NaN``.
:param positiveInf: sets the string representation of a positive infinity value. If None
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index 67375f6b5f..01364517ed 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -497,7 +497,8 @@ class DataStreamReader(OptionUtils):
being read should be skipped. If None is set, it uses
the default value, ``false``.
:param nullValue: sets the string representation of a null value. If None is set, it uses
- the default value, empty string.
+ the default value, empty string. Since 2.0.1, this ``nullValue`` param
+ applies to all supported types including the string type.
:param nanValue: sets the string representation of a non-number value. If None is set, it
uses the default value, ``NaN``.
:param positiveInf: sets the string representation of a positive infinity value. If None