[SPARK-2010] Support for nested data in PySpark SQL - spark

diff options

author	Kan Zhang <kzhang@apache.org>	2014-06-16 11:11:29 -0700
committer	Reynold Xin <rxin@apache.org>	2014-06-16 11:11:29 -0700
commit	4fdb491775bb9c4afa40477dc0069ff6fcadfe25 (patch)
tree	c996de6ecbf6f913b3e7bc8a45f7801aa58266ac /yarn/common/src/main
parent	716c88aa147762f7f617adf34a17edd681d9a4ff (diff)
download	spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.tar.gz spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.tar.bz2 spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.zip

[SPARK-2010] Support for nested data in PySpark SQL

JIRA issue https://issues.apache.org/jira/browse/SPARK-2010 This PR adds support for nested collection types in PySpark SQL, including array, dict, list, set, and tuple. Example, ``` >>> from array import array >>> from pyspark.sql import SQLContext >>> sqlCtx = SQLContext(sc) >>> rdd = sc.parallelize([ ... {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}] True >>> rdd = sc.parallelize([ ... {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == \ ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}] True ``` Author: Kan Zhang <kzhang@apache.org> Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits: 1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO 504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL

Diffstat (limited to 'yarn/common/src/main')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: