diff options
author | hyukjinkwon <gurwls223@gmail.com> | 2016-04-02 23:12:04 -0700 |
---|---|---|
committer | Davies Liu <davies.liu@gmail.com> | 2016-04-02 23:12:04 -0700 |
commit | 2262a93358c2f6d4cfb73645c4ebc963c5640ec8 (patch) | |
tree | 6c18cee7dfc269e1b6cf7e213aa7043afbc98469 /python/pyspark | |
parent | 7be46205083fc688249ee619ac7758904f7aa55d (diff) | |
download | spark-2262a93358c2f6d4cfb73645c4ebc963c5640ec8.tar.gz spark-2262a93358c2f6d4cfb73645c4ebc963c5640ec8.tar.bz2 spark-2262a93358c2f6d4cfb73645c4ebc963c5640ec8.zip |
[SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14231
Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
But there are few restrictions in Spark `DecimalType` below:
1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision.
Currently, both restrictions are not being handled.
This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
So, the codes below:
```scala
def doubleRecords: RDD[String] =
sqlContext.sparkContext.parallelize(
s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
val jsonDF = sqlContext.read
.option("prefersDecimal", "true")
.json(doubleRecords)
jsonDF.printSchema()
```
produces below:
- **Before**
```scala
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
at
...
```
- **After**
```scala
root
|-- a: double (nullable = true)
|-- b: double (nullable = true)
```
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #12030 from HyukjinKwon/SPARK-14231.
Diffstat (limited to 'python/pyspark')
-rw-r--r-- | python/pyspark/sql/readwriter.py | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index cca57a385c..0cef37e57c 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -152,8 +152,8 @@ class DataFrameReader(object): You can set the following JSON-specific options to deal with non-standard JSON files: * ``primitivesAsString`` (default ``false``): infers all primitive values as a string \ type - * `floatAsBigDecimal` (default `false`): infers all floating-point values as a decimal \ - type + * `prefersDecimal` (default `false`): infers all floating-point values as a decimal \ + type. If the values do not fit in decimal, then it infers them as doubles. * ``allowComments`` (default ``false``): ignores Java/C++ style comment in JSON records * ``allowUnquotedFieldNames`` (default ``false``): allows unquoted JSON field names * ``allowSingleQuotes`` (default ``true``): allows single quotes in addition to double \ |