[SPARK-17699] Support for parsing JSON string columns

Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions `from_json` that converts a string column into a nested `StructType` with a user specified schema. Example usage: ```scala val df = Seq("""{"a": 1}""").toDS() val schema = new StructType().add("a", IntegerType) df.select(from_json($"value", schema) as 'json) // => [json: <a: int>] ``` This PR adds support for java, scala and python. I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it). I left SQL out for now, because I'm not sure how users would specify a schema. Author: Michael Armbrust <michael@databricks.com> Closes #15274 from marmbrus/jsonParser.
author: Michael Armbrust <michael@databricks.com> 2016-09-29 13:01:10 -0700
committer: Yin Huai <yhuai@databricks.com> 2016-09-29 13:01:10 -0700
commit: fe33121a53384811a8e094ab6c05dc85b7c7ca87 (patch)
tree: d0575a3d0eefe46ea4b8e200e70d0834b566b477 /python
parent: 027dea8f294504bc5cd8bfedde546d171cb78657 (diff)
download: spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.gz
spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.bz2
spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.zip
1 files changed, 23 insertions, 0 deletions
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 89b3c07c07..45d6bf944b 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -1706,6 +1706,29 @@ def json_tuple(col, *fields):
     return Column(jc)
 
 
+@since(2.1)
+def from_json(col, schema, options={}):
+    """
+    Parses a column containing a JSON string into a [[StructType]] with the
+    specified schema. Returns `null`, in the case of an unparseable string.
+
+    :param col: string column in json format
+    :param schema: a StructType to use when parsing the json column
+    :param options: options to control parsing. accepts the same options as the json datasource
+
+    >>> from pyspark.sql.types import *
+    >>> data = [(1, '''{"a": 1}''')]
+    >>> schema = StructType([StructField("a", IntegerType())])
+    >>> df = spark.createDataFrame(data, ("key", "value"))
+    >>> df.select(from_json(df.value, schema).alias("json")).collect()
+    [Row(json=Row(a=1))]
+    """
+
+    sc = SparkContext._active_spark_context
+    jc = sc._jvm.functions.from_json(_to_java_column(col), schema.json(), options)
+    return Column(jc)
+
+
 @since(1.5)
 def size(col):
     """
author	Michael Armbrust <michael@databricks.com>	2016-09-29 13:01:10 -0700
committer	Yin Huai <yhuai@databricks.com>	2016-09-29 13:01:10 -0700
commit	fe33121a53384811a8e094ab6c05dc85b7c7ca87 (patch)
tree	d0575a3d0eefe46ea4b8e200e70d0834b566b477 /python
parent	027dea8f294504bc5cd8bfedde546d171cb78657 (diff)
download	spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.gz spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.tar.bz2 spark-fe33121a53384811a8e094ab6c05dc85b7c7ca87.zip