aboutsummaryrefslogtreecommitdiff
path: root/python
diff options
context:
space:
mode:
authorReynold Xin <rxin@databricks.com>2016-06-20 21:46:12 -0700
committerReynold Xin <rxin@databricks.com>2016-06-20 21:46:12 -0700
commitc775bf09e0c3540f76de3f15d3fd35112a4912c1 (patch)
tree7a778d6821ffc5555779598fa9dae0c812229f5e /python
parent217db56ba11fcdf9e3a81946667d1d99ad7344ee (diff)
downloadspark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.tar.gz
spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.tar.bz2
spark-c775bf09e0c3540f76de3f15d3fd35112a4912c1.zip
[SPARK-13792][SQL] Limit logging of bad records in CSV data source
## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin <rxin@databricks.com> Closes #13795 from rxin/SPARK-13792.
Diffstat (limited to 'python')
-rw-r--r--python/pyspark/sql/readwriter.py4
1 files changed, 4 insertions, 0 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 72fd184d58..89506ca02f 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -392,6 +392,10 @@ class DataFrameReader(ReaderUtils):
:param maxCharsPerColumn: defines the maximum number of characters allowed for any given
value being read. If None is set, it uses the default value,
``1000000``.
+ :param maxMalformedLogPerPartition: sets the maximum number of malformed rows Spark will
+ log for each partition. Malformed records beyond this
+ number will be ignored. If None is set, it
+ uses the default value, ``10``.
:param mode: allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value, ``PERMISSIVE``.