[SPARK-17765][SQL] Support for writing out user-defined type in ORC datasource - spark

diff options

author	hyukjinkwon <gurwls223@gmail.com>	2016-11-21 13:23:32 -0800
committer	Reynold Xin <rxin@databricks.com>	2016-11-21 13:23:32 -0800
commit	a2d464770cd183daa7d727bf377bde9c21e29e6a (patch)
tree	9268ffe990937be39519112f3fc21c2c70fef6cf /python/pyspark/sql/dataframe.py
parent	ddd02f50bb7458410d65427321efc75da5e65224 (diff)
download	spark-a2d464770cd183daa7d727bf377bde9c21e29e6a.tar.gz spark-a2d464770cd183daa7d727bf377bde9c21e29e6a.tar.bz2 spark-a2d464770cd183daa7d727bf377bde9c21e29e6a.zip

[SPARK-17765][SQL] Support for writing out user-defined type in ORC datasource

## What changes were proposed in this pull request? This PR adds the support for `UserDefinedType` when writing out instead of throwing `ClassCastException` in ORC data source. In more details, `OrcStruct` is being created based on string from`DataType.catalogString`. For user-defined type, it seems it returns `sqlType.simpleString` for `catalogString` by default[1]. However, during type-dispatching to match the output with the schema, it tries to cast to, for example, `StructType`[2]. So, running the codes below (`MyDenseVector` was borrowed[3]) : ``` scala val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25)))) val udtDF = data.toDF("id", "vectors") udtDF.write.orc("/tmp/test.orc") ``` ends up throwing an exception as below: ``` java.lang.ClassCastException: org.apache.spark.sql.UDT$MyDenseVectorUDT cannot be cast to org.apache.spark.sql.types.ArrayType at org.apache.spark.sql.hive.HiveInspectors$class.wrapperFor(HiveInspectors.scala:381) at org.apache.spark.sql.hive.orc.OrcSerializer.wrapperFor(OrcFileFormat.scala:164) ... ``` So, this PR uses `UserDefinedType.sqlType` during finding the correct converter when writing out in ORC data source. [1]https://github.com/apache/spark/blob/dfdcab00c7b6200c22883baa3ebc5818be09556f/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L95 [2]https://github.com/apache/spark/blob/d2dc8c4a162834818190ffd82894522c524ca3e5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L326 [3]https://github.com/apache/spark/blob/2bfed1a0c5be7d0718fd574a4dad90f4f6b44be7/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala#L38-L70 ## How was this patch tested? Unit tests in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15361 from HyukjinKwon/SPARK-17765.

Diffstat (limited to 'python/pyspark/sql/dataframe.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: