diff options
author | Michael Armbrust <michael@databricks.com> | 2016-09-29 13:01:10 -0700 |
---|---|---|
committer | Yin Huai <yhuai@databricks.com> | 2016-09-29 13:01:10 -0700 |
commit | fe33121a53384811a8e094ab6c05dc85b7c7ca87 (patch) | |
tree | d0575a3d0eefe46ea4b8e200e70d0834b566b477 /python/pyspark/sql/functions.py | |
parent | 027dea8f294504bc5cd8bfedde546d171cb78657 (diff) | |
download | spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.gz spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.bz2 spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.zip |
[SPARK-17699] Support for parsing JSON string columns
Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions `from_json` that converts a string column into a nested `StructType` with a user specified schema.
Example usage:
```scala
val df = Seq("""{"a": 1}""").toDS()
val schema = new StructType().add("a", IntegerType)
df.select(from_json($"value", schema) as 'json) // => [json: <a: int>]
```
This PR adds support for java, scala and python. I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it). I left SQL out for now, because I'm not sure how users would specify a schema.
Author: Michael Armbrust <michael@databricks.com>
Closes #15274 from marmbrus/jsonParser.
Diffstat (limited to 'python/pyspark/sql/functions.py')
-rw-r--r-- | python/pyspark/sql/functions.py | 23 |
1 files changed, 23 insertions, 0 deletions
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index 89b3c07c07..45d6bf944b 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -1706,6 +1706,29 @@ def json_tuple(col, *fields): return Column(jc) +@since(2.1) +def from_json(col, schema, options={}): + """ + Parses a column containing a JSON string into a [[StructType]] with the + specified schema. Returns `null`, in the case of an unparseable string. + + :param col: string column in json format + :param schema: a StructType to use when parsing the json column + :param options: options to control parsing. accepts the same options as the json datasource + + >>> from pyspark.sql.types import * + >>> data = [(1, '''{"a": 1}''')] + >>> schema = StructType([StructField("a", IntegerType())]) + >>> df = spark.createDataFrame(data, ("key", "value")) + >>> df.select(from_json(df.value, schema).alias("json")).collect() + [Row(json=Row(a=1))] + """ + + sc = SparkContext._active_spark_context + jc = sc._jvm.functions.from_json(_to_java_column(col), schema.json(), options) + return Column(jc) + + @since(1.5) def size(col): """ |