[SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values properly

## Problem CSV in Spark 2.0.0: - does not read null values back correctly for certain data types such as `Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6; - does not read empty values (specified by `options.nullValue`) as `null`s for `StringType` -- this is compatible with 1.6 but leads to problems like SPARK-16903. ## What changes were proposed in this pull request? This patch makes changes to read all empty values back as `null`s. ## How was this patch tested? New test cases. Author: Liwei Lin <lwlin7@gmail.com> Closes #14118 from lw-lin/csv-cast-null.
author: Liwei Lin <lwlin7@gmail.com> 2016-09-18 19:25:58 +0100
committer: Sean Owen <sowen@cloudera.com> 2016-09-18 19:25:58 +0100
commit: 1dbb725dbef30bf7633584ce8efdb573f2d92bca (patch)
tree: ca63691ee0b6e70ed661c95743c4c140126bb0e2 /python/pyspark
parent: 7151011b38a841d9d4bc2e453b9a7cfe42f74f8f (diff)
download: spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.gz
spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.bz2
spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.zip
2 files changed, 4 insertions, 2 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 3d79e0cccc..a6860efa89 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -329,7 +329,8 @@ class DataFrameReader(OptionUtils):
                                          being read should be skipped. If None is set, it uses
                                          the default value, ``false``.
         :param nullValue: sets the string representation of a null value. If None is set, it uses
-                          the default value, empty string.
+                          the default value, empty string. Since 2.0.1, this ``nullValue`` param
+                          applies to all supported types including the string type.
         :param nanValue: sets the string representation of a non-number value. If None is set, it
                          uses the default value, ``NaN``.
         :param positiveInf: sets the string representation of a positive infinity value. If None
diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py
index 67375f6b5f..01364517ed 100644
--- a/python/pyspark/sql/streaming.py
+++ b/python/pyspark/sql/streaming.py
@@ -497,7 +497,8 @@ class DataStreamReader(OptionUtils):
                                          being read should be skipped. If None is set, it uses
                                          the default value, ``false``.
         :param nullValue: sets the string representation of a null value. If None is set, it uses
-                          the default value, empty string.
+                          the default value, empty string. Since 2.0.1, this ``nullValue`` param
+                          applies to all supported types including the string type.
         :param nanValue: sets the string representation of a non-number value. If None is set, it
                          uses the default value, ``NaN``.
         :param positiveInf: sets the string representation of a positive infinity value. If None
author	Liwei Lin <lwlin7@gmail.com>	2016-09-18 19:25:58 +0100
committer	Sean Owen <sowen@cloudera.com>	2016-09-18 19:25:58 +0100
commit	1dbb725dbef30bf7633584ce8efdb573f2d92bca (patch)
tree	ca63691ee0b6e70ed661c95743c4c140126bb0e2 /python/pyspark
parent	7151011b38a841d9d4bc2e453b9a7cfe42f74f8f (diff)
download	spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.gz spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.tar.bz2 spark-1dbb725dbef30bf7633584ce8efdb573f2d92bca.zip