From e474088144cdd2632cf2fef6b2cf10b3cd191c23 Mon Sep 17 00:00:00 2001 From: hyukjinkwon Date: Mon, 21 Mar 2016 15:42:35 +0800 Subject: [SPARK-13764][SQL] Parse modes in JSON data source ## What changes were proposed in this pull request? Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source . This PR adds the support for parse modes just like CSV data source. There are three modes below: - `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode. - `DROPMALFORMED`: When it fails to parse, this drops the whole record. - `FAILFAST`: When it fails to parse, it just throws an exception. This PR also make JSON data source share the `ParseModes` in CSV data source. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon Closes #11756 from HyukjinKwon/SPARK-13764. --- python/pyspark/sql/readwriter.py | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'python/pyspark/sql') diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index 438662bb15..bae9e69df8 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -162,6 +162,14 @@ class DataFrameReader(object): (e.g. 00012) * ``allowBackslashEscapingAnyCharacter`` (default ``false``): allows accepting quoting \ of all character using backslash quoting mechanism + * ``mode`` (default ``PERMISSIVE``): allows a mode for dealing with corrupt records \ + during parsing. + * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \ + record and puts the malformed string into a new field configured by \ + ``spark.sql.columnNameOfCorruptRecord``. When a schema is set by user, it sets \ + ``null`` for extra fields. + * ``DROPMALFORMED`` : ignores the whole corrupted records. + * ``FAILFAST`` : throws an exception when it meets corrupted records. >>> df1 = sqlContext.read.json('python/test_support/sql/people.json') >>> df1.dtypes -- cgit v1.2.3