[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema

When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided. Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect. marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606 Author: Jason White <jason.white@shopify.com> Closes #9392 from JasonMWhite/createDataFrame_without_take.
author: Jason White <jason.white@shopify.com> 2015-11-02 10:49:06 -0800
committer: Davies Liu <davies.liu@gmail.com> 2015-11-02 10:49:06 -0800
commit: f92f334ca47c03b980b06cf300aa652d0ffa1880 (patch)
tree: f692cc7d9ddae0329ab0b7b8201e0e4a3a9b17c9 /python
parent: 71d1c907dec446db566b19f912159fd8f46deb7d (diff)
download: spark-f92f334ca47c03b980b06cf300aa652d0ffa1880.tar.gz
spark-f92f334ca47c03b980b06cf300aa652d0ffa1880.tar.bz2
spark-f92f334ca47c03b980b06cf300aa652d0ffa1880.zip
1 files changed, 1 insertions, 7 deletions
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index 79453658a1..924bb6433d 100644
--- a/python/pyspark/sql/context.py
+++ b/python/pyspark/sql/context.py
@@ -318,13 +318,7 @@ class SQLContext(object):
                     struct.names[i] = name
             schema = struct
 
-        elif isinstance(schema, StructType):
-            # take the first few rows to verify schema
-            rows = rdd.take(10)
-            for row in rows:
-                _verify_type(row, schema)
-
-        else:
+        elif not isinstance(schema, StructType):
             raise TypeError("schema should be StructType or list or None, but got: %s" % schema)
 
         # convert python objects to sql data
author	Jason White <jason.white@shopify.com>	2015-11-02 10:49:06 -0800
committer	Davies Liu <davies.liu@gmail.com>	2015-11-02 10:49:06 -0800
commit	f92f334ca47c03b980b06cf300aa652d0ffa1880 (patch)
tree	f692cc7d9ddae0329ab0b7b8201e0e4a3a9b17c9 /python
parent	71d1c907dec446db566b19f912159fd8f46deb7d (diff)
download	spark-f92f334ca47c03b980b06cf300aa652d0ffa1880.tar.gz spark-f92f334ca47c03b980b06cf300aa652d0ffa1880.tar.bz2 spark-f92f334ca47c03b980b06cf300aa652d0ffa1880.zip