diff options
author | Cheng Lian <lian@databricks.com> | 2017-03-10 15:19:32 -0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2017-03-10 15:19:32 -0800 |
commit | ffee4f1cefb0dfd8d9145ee3be82c6f7b799870b (patch) | |
tree | 5f264abe401434c141e99463115c7ad44d88e50d | |
parent | bc30351404d8bc610cbae65fdc12ca613e7735c6 (diff) | |
download | spark-ffee4f1cefb0dfd8d9145ee3be82c6f7b799870b.tar.gz spark-ffee4f1cefb0dfd8d9145ee3be82c6f7b799870b.tar.bz2 spark-ffee4f1cefb0dfd8d9145ee3be82c6f7b799870b.zip |
[SPARK-19905][SQL] Bring back Dataset.inputFiles for Hive SerDe tables
## What changes were proposed in this pull request?
`Dataset.inputFiles` works by matching `FileRelation`s in the query plan. In Spark 2.1, Hive SerDe tables are represented by `MetastoreRelation`, which inherits from `FileRelation`. However, in Spark 2.2, Hive SerDe tables are now represented by `CatalogRelation`, which doesn't inherit from `FileRelation` anymore, due to the unification of Hive SerDe tables and data source tables. This change breaks `Dataset.inputFiles` for Hive SerDe tables.
This PR tries to fix this issue by explicitly matching `CatalogRelation`s that are Hive SerDe tables in `Dataset.inputFiles`. Note that we can't make `CatalogRelation` inherit from `FileRelation` since not all `CatalogRelation`s are file based (e.g., JDBC data source tables).
## How was this patch tested?
New test case added in `HiveDDLSuite`.
Author: Cheng Lian <lian@databricks.com>
Closes #17247 from liancheng/spark-19905-hive-table-input-files.
-rw-r--r-- | sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 3 | ||||
-rw-r--r-- | sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala | 11 |
2 files changed, 14 insertions, 0 deletions
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 0a4d3a93a0..520663f624 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -36,6 +36,7 @@ import org.apache.spark.broadcast.Broadcast import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst._ import org.apache.spark.sql.catalyst.analysis._ +import org.apache.spark.sql.catalyst.catalog.CatalogRelation import org.apache.spark.sql.catalyst.encoders._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate._ @@ -2734,6 +2735,8 @@ class Dataset[T] private[sql]( fsBasedRelation.inputFiles case fr: FileRelation => fr.inputFiles + case r: CatalogRelation if DDLUtils.isHiveTable(r.tableMeta) => + r.tableMeta.storage.locationUri.map(_.toString).toArray }.flatten files.toSet.toArray } diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala index 23aea24697..79ad156c55 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala @@ -1865,4 +1865,15 @@ class HiveDDLSuite } } } + + test("SPARK-19905: Hive SerDe table input paths") { + withTable("spark_19905") { + withTempView("spark_19905_view") { + spark.range(10).createOrReplaceTempView("spark_19905_view") + sql("CREATE TABLE spark_19905 STORED AS RCFILE AS SELECT * FROM spark_19905_view") + assert(spark.table("spark_19905").inputFiles.nonEmpty) + assert(sql("SELECT input_file_name() FROM spark_19905").count() > 0) + } + } + } } |