diff options
author | Kan Zhang <kzhang@apache.org> | 2014-06-16 11:11:29 -0700 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2014-06-16 11:11:29 -0700 |
commit | 4fdb491775bb9c4afa40477dc0069ff6fcadfe25 (patch) | |
tree | c996de6ecbf6f913b3e7bc8a45f7801aa58266ac /yarn/common/src/main | |
parent | 716c88aa147762f7f617adf34a17edd681d9a4ff (diff) | |
download | spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.tar.gz spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.tar.bz2 spark-4fdb491775bb9c4afa40477dc0069ff6fcadfe25.zip |
[SPARK-2010] Support for nested data in PySpark SQL
JIRA issue https://issues.apache.org/jira/browse/SPARK-2010
This PR adds support for nested collection types in PySpark SQL, including
array, dict, list, set, and tuple. Example,
```
>>> from array import array
>>> from pyspark.sql import SQLContext
>>> sqlCtx = SQLContext(sc)
>>> rdd = sc.parallelize([
... {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]
True
>>> rdd = sc.parallelize([
... {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect() == \
... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]
True
```
Author: Kan Zhang <kzhang@apache.org>
Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits:
1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO
504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL
Diffstat (limited to 'yarn/common/src/main')
0 files changed, 0 insertions, 0 deletions