[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in data sources implementing FileFormat

## What changes were proposed in this pull request? This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source. https://github.com/apache/spark/pull/14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately. Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below: ``` scala spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet") spark.read.parquet("/tmp/parquet") ``` gives the paths below without directories but just valid data files: ``` bash /tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet /tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet ... ``` to `FileFormat.inferSchema`. ## How was this patch tested? Unit test added in `HadoopFsRelationTest` and related existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14627 from HyukjinKwon/SPARK-16975.
author: hyukjinkwon <gurwls223@gmail.com> 2016-12-22 10:00:20 -0800
committer: Reynold Xin <rxin@databricks.com> 2016-12-22 10:00:20 -0800
commit: 76622c661fcae81eb0352c61f54a2e9e21a4fb98 (patch)
tree: 4d8d04af3668e155786ba51d201c5e3e701feda4 /sql/hive
parent: 4186aba632eaee2cc2c1ba6906449375c98b6c5c (diff)
download: spark-76622c661fcae81eb0352c61f54a2e9e21a4fb98.tar.gz
spark-76622c661fcae81eb0352c61f54a2e9e21a4fb98.tar.bz2
spark-76622c661fcae81eb0352c61f54a2e9e21a4fb98.zip
1 files changed, 17 insertions, 0 deletions
diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala b/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
index 224b2c6c6f..06566a9550 100644
--- a/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
@@ -877,6 +877,23 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
       }
     }
   }
+
+  test("SPARK-16975: Partitioned table with the column having '_' should be read correctly") {
+    withTempDir { dir =>
+      val childDir = new File(dir, dataSourceName).getCanonicalPath
+      val dataDf = spark.range(10).toDF()
+      val df = dataDf.withColumn("_col", $"id")
+      df.write.format(dataSourceName).partitionBy("_col").save(childDir)
+      val reader = spark.read.format(dataSourceName)
+
+      // This is needed for SimpleTextHadoopFsRelationSuite as SimpleTextSource needs schema.
+      if (dataSourceName == classOf[SimpleTextSource].getCanonicalName) {
+        reader.option("dataSchema", dataDf.schema.json)
+      }
+      val readBack = reader.load(childDir)
+      checkAnswer(df, readBack)
+    }
+  }
 }
 
 // This class is used to test SPARK-8578. We should not use any custom output committer when
author	hyukjinkwon <gurwls223@gmail.com>	2016-12-22 10:00:20 -0800
committer	Reynold Xin <rxin@databricks.com>	2016-12-22 10:00:20 -0800
commit	76622c661fcae81eb0352c61f54a2e9e21a4fb98 (patch)
tree	4d8d04af3668e155786ba51d201c5e3e701feda4 /sql/hive
parent	4186aba632eaee2cc2c1ba6906449375c98b6c5c (diff)
download	spark-76622c661fcae81eb0352c61f54a2e9e21a4fb98.tar.gz spark-76622c661fcae81eb0352c61f54a2e9e21a4fb98.tar.bz2 spark-76622c661fcae81eb0352c61f54a2e9e21a4fb98.zip