aboutsummaryrefslogtreecommitdiff
path: root/docs/running-on-yarn.md
diff options
context:
space:
mode:
authorKan Zhang <kzhang@apache.org>2014-06-16 11:11:29 -0700
committerReynold Xin <rxin@apache.org>2014-06-16 11:11:29 -0700
commit4fdb491775bb9c4afa40477dc0069ff6fcadfe25 (patch)
treec996de6ecbf6f913b3e7bc8a45f7801aa58266ac /docs/running-on-yarn.md
parent716c88aa147762f7f617adf34a17edd681d9a4ff (diff)
downloadspark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.tar.gz
spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.tar.bz2
spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.zip
[SPARK-2010] Support for nested data in PySpark SQL
JIRA issue https://issues.apache.org/jira/browse/SPARK-2010 This PR adds support for nested collection types in PySpark SQL, including array, dict, list, set, and tuple. Example, ``` >>> from array import array >>> from pyspark.sql import SQLContext >>> sqlCtx = SQLContext(sc) >>> rdd = sc.parallelize([ ... {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}] True >>> rdd = sc.parallelize([ ... {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == \ ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}] True ``` Author: Kan Zhang <kzhang@apache.org> Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits: 1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO 504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL
Diffstat (limited to 'docs/running-on-yarn.md')
0 files changed, 0 insertions, 0 deletions