[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling

This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling. If sampling is presented, it will infer schema from all the rows after sampling. Also, add samplingRatio for jsonFile() and jsonRDD() Author: Davies Liu <davies.liu@gmail.com> Author: Davies Liu <davies@databricks.com> Closes #2716 from davies/infer and squashes the following commits: e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 567dc60 [Davies Liu] update docs 9767b27 [Davies Liu] Merge branch 'master' into infer e48d7fb [Davies Liu] fix tests 29e94d5 [Davies Liu] let NullType inherit from PrimitiveType ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 540d1d5 [Davies Liu] merge fields for StructType f93fd84 [Davies Liu] add more tests 3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
author: Davies Liu <davies.liu@gmail.com> 2014-11-03 13:17:09 -0800
committer: Michael Armbrust <michael@databricks.com> 2014-11-03 13:17:09 -0800
commit: 24544fbce05665ab4999a1fe5aac434d29cd912c (patch)
tree: f52fa3f5edc6e7bf544cbeaecee8740ea2449783 /sql
parent: 2b6e1ce6ee7b1ba8160bcbee97f5bbff5c46ca09 (diff)
download: spark-24544fbce05665ab4999a1fe5aac434d29cd912c.tar.gz
spark-24544fbce05665ab4999a1fe5aac434d29cd912c.tar.bz2
spark-24544fbce05665ab4999a1fe5aac434d29cd912c.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
index cc5015ad3c..e1b5992a36 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
@@ -213,7 +213,7 @@ trait PrimitiveType extends DataType {
 }
 
 object PrimitiveType {
-  private val nonDecimals = Seq(DateType, TimestampType, BinaryType) ++ NativeType.all
+  private val nonDecimals = Seq(NullType, DateType, TimestampType, BinaryType) ++ NativeType.all
   private val nonDecimalNameToType = nonDecimals.map(t => t.typeName -> t).toMap
 
   /** Given the string representation of a type, return its DataType */
author	Davies Liu <davies.liu@gmail.com>	2014-11-03 13:17:09 -0800
committer	Michael Armbrust <michael@databricks.com>	2014-11-03 13:17:09 -0800
commit	24544fbce05665ab4999a1fe5aac434d29cd912c (patch)
tree	f52fa3f5edc6e7bf544cbeaecee8740ea2449783 /sql
parent	2b6e1ce6ee7b1ba8160bcbee97f5bbff5c46ca09 (diff)
download	spark-24544fbce05665ab4999a1fe5aac434d29cd912c.tar.gz spark-24544fbce05665ab4999a1fe5aac434d29cd912c.tar.bz2 spark-24544fbce05665ab4999a1fe5aac434d29cd912c.zip