[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization

## What changes were proposed in this pull request? In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294). On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one. ## How was this patch tested? New test case added in `ParquetQuerySuite`. [1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294 Author: Cheng Lian <lian@databricks.com> Closes #14278 from liancheng/spark-16632-simpler-fix.
author: Cheng Lian <lian@databricks.com> 2016-07-21 17:15:07 +0800
committer: Cheng Lian <lian@databricks.com> 2016-07-21 17:15:07 +0800
commit: 8674054d3402b400a4766fe1c9214001cebf2106 (patch)
tree: 86b683ae274455314ffd4386c348310537aa1956 /sql/core/src/main/java
parent: 864b764eafa57a1418b683ccf6899b01bab28fba (diff)
download: spark-8674054d3402b400a4766fe1c9214001cebf2106.tar.gz
spark-8674054d3402b400a4766fe1c9214001cebf2106.tar.bz2
spark-8674054d3402b400a4766fe1c9214001cebf2106.zip
1 files changed, 4 insertions, 1 deletions
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
index d823275d85..04752ec5fe 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
@@ -60,6 +60,7 @@ import org.apache.parquet.hadoop.util.ConfigurationUtil;
 import org.apache.parquet.schema.MessageType;
 import org.apache.parquet.schema.Types;
 import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.types.StructType$;
 
 /**
  * Base class for custom RecordReaders for Parquet that directly materialize to `T`.
@@ -136,7 +137,9 @@ public abstract class SpecificParquetRecordReaderBase<T> extends RecordReader<Vo
     ReadSupport.ReadContext readContext = readSupport.init(new InitContext(
         taskAttemptContext.getConfiguration(), toSetMultiMap(fileMetadata), fileSchema));
     this.requestedSchema = readContext.getRequestedSchema();
-    this.sparkSchema = new ParquetSchemaConverter(configuration).convert(requestedSchema);
+    String sparkRequestedSchemaString =
+        configuration.get(ParquetReadSupport$.MODULE$.SPARK_ROW_REQUESTED_SCHEMA());
+    this.sparkSchema = StructType$.MODULE$.fromString(sparkRequestedSchemaString);
     this.reader = new ParquetFileReader(configuration, file, blocks, requestedSchema.getColumns());
     for (BlockMetaData block : blocks) {
       this.totalRowCount += block.getRowCount();
author	Cheng Lian <lian@databricks.com>	2016-07-21 17:15:07 +0800
committer	Cheng Lian <lian@databricks.com>	2016-07-21 17:15:07 +0800
commit	8674054d3402b400a4766fe1c9214001cebf2106 (patch)
tree	86b683ae274455314ffd4386c348310537aa1956 /sql/core/src/main/java
parent	864b764eafa57a1418b683ccf6899b01bab28fba (diff)
download	spark-8674054d3402b400a4766fe1c9214001cebf2106.tar.gz spark-8674054d3402b400a4766fe1c9214001cebf2106.tar.bz2 spark-8674054d3402b400a4766fe1c9214001cebf2106.zip