diff options
author | hyukjinkwon <gurwls223@gmail.com> | 2017-03-19 22:33:01 -0700 |
---|---|---|
committer | Felix Cheung <felixcheung@apache.org> | 2017-03-19 22:33:01 -0700 |
commit | 0cdcf9114527a2c359c25e46fd6556b3855bfb28 (patch) | |
tree | b315a01420500d41669e9436658626f8890b7143 /python/pyspark | |
parent | 990af630d0d569880edd9c7ce9932e10037a28ab (diff) | |
download | spark-0cdcf9114527a2c359c25e46fd6556b3855bfb28.tar.gz spark-0cdcf9114527a2c359c25e46fd6556b3855bfb28.tar.bz2 spark-0cdcf9114527a2c359c25e46fd6556b3855bfb28.zip |
[SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array
## What changes were proposed in this pull request?
This PR proposes to support an array of struct type in `to_json` as below:
```scala
import org.apache.spark.sql.functions._
val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
df.select(to_json($"a").as("json")).show()
```
```
+----------+
| json|
+----------+
|[{"_1":1}]|
+----------+
```
Currently, it throws an exception as below (a newline manually inserted for readability):
```
org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
mismatch: structtojson requires that the expression is a struct expression.;;
```
This allows the roundtrip with `from_json` as below:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
df.show()
// Read back.
df.select(to_json($"array").as("json")).show()
```
```
+----------+
| array|
+----------+
|[[1], [2]]|
+----------+
+-----------------+
| json|
+-----------------+
|[{"a":1},{"a":2}]|
+-----------------+
```
Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
## How was this patch tested?
Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17192 from HyukjinKwon/SPARK-19849.
Diffstat (limited to 'python/pyspark')
-rw-r--r-- | python/pyspark/sql/functions.py | 15 |
1 files changed, 10 insertions, 5 deletions
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 376b86ea69..f9121e60f3 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -1774,10 +1774,11 @@ def json_tuple(col, *fields): def from_json(col, schema, options={}): """ Parses a column containing a JSON string into a [[StructType]] or [[ArrayType]] - with the specified schema. Returns `null`, in the case of an unparseable string. + of [[StructType]]s with the specified schema. Returns `null`, in the case of an unparseable + string. :param col: string column in json format - :param schema: a StructType or ArrayType to use when parsing the json column + :param schema: a StructType or ArrayType of StructType to use when parsing the json column :param options: options to control parsing. accepts the same options as the json datasource >>> from pyspark.sql.types import * @@ -1802,10 +1803,10 @@ def from_json(col, schema, options={}): @since(2.1) def to_json(col, options={}): """ - Converts a column containing a [[StructType]] into a JSON string. Throws an exception, - in the case of an unsupported type. + Converts a column containing a [[StructType]] or [[ArrayType]] of [[StructType]]s into a + JSON string. Throws an exception, in the case of an unsupported type. - :param col: name of column containing the struct + :param col: name of column containing the struct or array of the structs :param options: options to control converting. accepts the same options as the json datasource >>> from pyspark.sql import Row @@ -1814,6 +1815,10 @@ def to_json(col, options={}): >>> df = spark.createDataFrame(data, ("key", "value")) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json=u'{"age":2,"name":"Alice"}')] + >>> data = [(1, [Row(name='Alice', age=2), Row(name='Bob', age=3)])] + >>> df = spark.createDataFrame(data, ("key", "value")) + >>> df.select(to_json(df.value).alias("json")).collect() + [Row(json=u'[{"age":2,"name":"Alice"},{"age":3,"name":"Bob"}]')] """ sc = SparkContext._active_spark_context |