[SPARK-11745][SQL] Enable more JSON parsing options

This patch adds the following options to the JSON data source, for dealing with non-standard JSON files: * `allowComments` (default `false`): ignores Java/C++ style comment in JSON records * `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names * `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes * `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012) To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options. Also updated documentation to explain these options. Scala ![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png) Python ![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png) Author: Reynold Xin <rxin@databricks.com> Closes #9724 from rxin/SPARK-11745.
author: Reynold Xin <rxin@databricks.com> 2015-11-16 00:06:14 -0800
committer: Reynold Xin <rxin@databricks.com> 2015-11-16 00:06:14 -0800
commit: 42de5253f327bd7ee258b0efb5024f3847fa3b51 (patch)
tree: 67437c76160ebee36d3b378599b1d519e6c72f8a /python/pyspark/sql/readwriter.py
parent: fd50fa4c3eff42e8adeeabe399ddba0edac930c8 (diff)
download: spark-42de5253f327bd7ee258b0efb5024f3847fa3b51.tar.gz
spark-42de5253f327bd7ee258b0efb5024f3847fa3b51.tar.bz2
spark-42de5253f327bd7ee258b0efb5024f3847fa3b51.zip
1 files changed, 10 insertions, 0 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index 927f407742..7b8ddb9feb 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -153,6 +153,16 @@ class DataFrameReader(object):
                      or RDD of Strings storing JSON objects.
         :param schema: an optional :class:`StructType` for the input schema.
 
+        You can set the following JSON-specific options to deal with non-standard JSON files:
+            * ``primitivesAsString`` (default ``false``): infers all primitive values as a string \
+                type
+            * ``allowComments`` (default ``false``): ignores Java/C++ style comment in JSON records
+            * ``allowUnquotedFieldNames`` (default ``false``): allows unquoted JSON field names
+            * ``allowSingleQuotes`` (default ``true``): allows single quotes in addition to double \
+                quotes
+            * ``allowNumericLeadingZeros`` (default ``false``): allows leading zeros in numbers \
+                (e.g. 00012)
+
         >>> df1 = sqlContext.read.json('python/test_support/sql/people.json')
         >>> df1.dtypes
         [('age', 'bigint'), ('name', 'string')]
author	Reynold Xin <rxin@databricks.com>	2015-11-16 00:06:14 -0800
committer	Reynold Xin <rxin@databricks.com>	2015-11-16 00:06:14 -0800
commit	42de5253f327bd7ee258b0efb5024f3847fa3b51 (patch)
tree	67437c76160ebee36d3b378599b1d519e6c72f8a /python/pyspark/sql/readwriter.py
parent	fd50fa4c3eff42e8adeeabe399ddba0edac930c8 (diff)
download	spark-42de5253f327bd7ee258b0efb5024f3847fa3b51.tar.gz spark-42de5253f327bd7ee258b0efb5024f3847fa3b51.tar.bz2 spark-42de5253f327bd7ee258b0efb5024f3847fa3b51.zip