From 4e09a0d5ea50d1cfc936bc87cf3372b4a0aa7dc2 Mon Sep 17 00:00:00 2001 From: hyukjinkwon Date: Tue, 22 Mar 2016 20:30:48 +0800 Subject: [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13953 Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string. This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration. This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon Closes #11881 from HyukjinKwon/SPARK-13953. --- python/pyspark/sql/readwriter.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'python/pyspark/sql') diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index bae9e69df8..cca57a385c 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -166,10 +166,13 @@ class DataFrameReader(object): during parsing. * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \ record and puts the malformed string into a new field configured by \ - ``spark.sql.columnNameOfCorruptRecord``. When a schema is set by user, it sets \ + ``columnNameOfCorruptRecord``. When a schema is set by user, it sets \ ``null`` for extra fields. * ``DROPMALFORMED`` : ignores the whole corrupted records. * ``FAILFAST`` : throws an exception when it meets corrupted records. + * ``columnNameOfCorruptRecord`` (default ``_corrupt_record``): allows renaming the \ + new field having malformed string created by ``PERMISSIVE`` mode. \ + This overrides ``spark.sql.columnNameOfCorruptRecord``. >>> df1 = sqlContext.read.json('python/test_support/sql/people.json') >>> df1.dtypes -- cgit v1.2.3