[SPARK-13792][SQL] Limit logging of bad records in CSV data source

## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin <rxin@databricks.com> Closes #13795 from rxin/SPARK-13792.
author: Reynold Xin <rxin@databricks.com> 2016-06-20 21:46:12 -0700
committer: Reynold Xin <rxin@databricks.com> 2016-06-20 21:46:12 -0700
commit: c775bf09e0c3540f76de3f15d3fd35112a4912c1 (patch)
tree: 7a778d6821ffc5555779598fa9dae0c812229f5e /python
parent: 217db56ba11fcdf9e3a81946667d1d99ad7344ee (diff)
download: spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.tar.gz
spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.tar.bz2
spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.zip
1 files changed, 4 insertions, 0 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 72fd184d58..89506ca02f 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -392,6 +392,10 @@ class DataFrameReader(ReaderUtils):
         :param maxCharsPerColumn: defines the maximum number of characters allowed for any given
                                   value being read. If None is set, it uses the default value,
                                   ``1000000``.
+        :param maxMalformedLogPerPartition: sets the maximum number of malformed rows Spark will
+                                            log for each partition. Malformed records beyond this
+                                            number will be ignored. If None is set, it
+                                            uses the default value, ``10``.
         :param mode: allows a mode for dealing with corrupt records during parsing. If None is
                      set, it uses the default value, ``PERMISSIVE``.
author	Reynold Xin <rxin@databricks.com>	2016-06-20 21:46:12 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-06-20 21:46:12 -0700
commit	c775bf09e0c3540f76de3f15d3fd35112a4912c1 (patch)
tree	7a778d6821ffc5555779598fa9dae0c812229f5e /python
parent	217db56ba11fcdf9e3a81946667d1d99ad7344ee (diff)
download	spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.tar.gz spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.tar.bz2 spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.zip