aboutsummaryrefslogtreecommitdiff
path: root/python
diff options
context:
space:
mode:
authorMichael Armbrust <michael@databricks.com>2016-09-29 13:01:10 -0700
committerYin Huai <yhuai@databricks.com>2016-09-29 13:01:10 -0700
commitfe33121a53384811a8e094ab6c05dc85b7c7ca87 (patch)
treed0575a3d0eefe46ea4b8e200e70d0834b566b477 /python
parent027dea8f294504bc5cd8bfedde546d171cb78657 (diff)
downloadspark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.gz
spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.bz2
spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.zip
[SPARK-17699] Support for parsing JSON string columns
Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions `from_json` that converts a string column into a nested `StructType` with a user specified schema. Example usage: ```scala val df = Seq("""{"a": 1}""").toDS() val schema = new StructType().add("a", IntegerType) df.select(from_json($"value", schema) as 'json) // => [json: <a: int>] ``` This PR adds support for java, scala and python. I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it). I left SQL out for now, because I'm not sure how users would specify a schema. Author: Michael Armbrust <michael@databricks.com> Closes #15274 from marmbrus/jsonParser.
Diffstat (limited to 'python')
-rw-r--r--python/pyspark/sql/functions.py23
1 files changed, 23 insertions, 0 deletions
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 89b3c07c07..45d6bf944b 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -1706,6 +1706,29 @@ def json_tuple(col, *fields):
return Column(jc)
+@since(2.1)
+def from_json(col, schema, options={}):
+ """
+ Parses a column containing a JSON string into a [[StructType]] with the
+ specified schema. Returns `null`, in the case of an unparseable string.
+
+ :param col: string column in json format
+ :param schema: a StructType to use when parsing the json column
+ :param options: options to control parsing. accepts the same options as the json datasource
+
+ >>> from pyspark.sql.types import *
+ >>> data = [(1, '''{"a": 1}''')]
+ >>> schema = StructType([StructField("a", IntegerType())])
+ >>> df = spark.createDataFrame(data, ("key", "value"))
+ >>> df.select(from_json(df.value, schema).alias("json")).collect()
+ [Row(json=Row(a=1))]
+ """
+
+ sc = SparkContext._active_spark_context
+ jc = sc._jvm.functions.from_json(_to_java_column(col), schema.json(), options)
+ return Column(jc)
+
+
@since(1.5)
def size(col):
"""