[SPARK-13774][SQL] - Improve error message for non-existent paths and add tests

SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path **Overview:** - If a non-existent path is given in this call `` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") `` it throws the following error: `java.lang.IllegalArgumentException: Can not create a Path from an empty string` ….. `It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation` - The purpose of this JIRA is to throw a better error message. - With the fix, you will now get a _Path does not exist_ error message. ``` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204) ... at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) ... 49 elided ``` **Details** _Changes include:_ - Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like “Path does not exist: $path” - AnalysisException is thrown similar to the exceptions thrown in resolveRelation. - The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not. _Test modifications:_ - Changes went in for 3 tests to account for this error checking. - SQLQuerySuite:test("run sql directly on files") – Error message needed to be updated. - 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test _New Tests:_ 2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message. _Testing:_ Unit tests were run with the fix. **Notes/Questions to reviewers:** - There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ? - From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know. I would appreciate your review. Thanks for your time and comments. Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #11775 from skambha/improve_errmsg.
author: Sunitha Kambhampati <skambha@us.ibm.com> 2016-03-22 20:47:57 +0800
committer: Cheng Lian <lian@databricks.com> 2016-03-22 20:47:57 +0800
commit: 0ce01635cc66ca5f9d8962235054335b16f7507e (patch)
tree: 8789537071873053028d5b65aca0532cda754953 /sql/hive
parent: 4e09a0d5ea50d1cfc936bc87cf3372b4a0aa7dc2 (diff)
download: spark-0ce01635cc66ca5f9d8962235054335b16f7507e.tar.gz
spark-0ce01635cc66ca5f9d8962235054335b16f7507e.tar.bz2
spark-0ce01635cc66ca5f9d8962235054335b16f7507e.zip
1 files changed, 51 insertions, 46 deletions
diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
index a80c35cd81..3f3d0692b7 100644
--- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
+++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
@@ -693,23 +693,25 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
   test("SPARK-6024 wide schema support") {
     withSQLConf(SQLConf.SCHEMA_STRING_LENGTH_THRESHOLD.key -> "4000") {
       withTable("wide_schema") {
-        // We will need 80 splits for this schema if the threshold is 4000.
-        val schema = StructType((1 to 5000).map(i => StructField(s"c_$i", StringType, true)))
-
-        // Manually create a metastore data source table.
-        sessionState.catalog.createDataSourceTable(
-          tableIdent = TableIdentifier("wide_schema"),
-          userSpecifiedSchema = Some(schema),
-          partitionColumns = Array.empty[String],
-          bucketSpec = None,
-          provider = "json",
-          options = Map("path" -> "just a dummy path"),
-          isExternal = false)
-
-        invalidateTable("wide_schema")
-
-        val actualSchema = table("wide_schema").schema
-        assert(schema === actualSchema)
+        withTempDir( tempDir => {
+          // We will need 80 splits for this schema if the threshold is 4000.
+          val schema = StructType((1 to 5000).map(i => StructField(s"c_$i", StringType, true)))
+
+          // Manually create a metastore data source table.
+          sessionState.catalog.createDataSourceTable(
+            tableIdent = TableIdentifier("wide_schema"),
+            userSpecifiedSchema = Some(schema),
+            partitionColumns = Array.empty[String],
+            bucketSpec = None,
+            provider = "json",
+            options = Map("path" -> tempDir.getCanonicalPath),
+            isExternal = false)
+
+          invalidateTable("wide_schema")
+
+          val actualSchema = table("wide_schema").schema
+          assert(schema === actualSchema)
+        })
       }
     }
   }
@@ -899,35 +901,38 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
     sqlContext.sql("""drop database if exists testdb8156 CASCADE""")
   }
 
+
   test("skip hive metadata on table creation") {
-    val schema = StructType((1 to 5).map(i => StructField(s"c_$i", StringType)))
-
-    sessionState.catalog.createDataSourceTable(
-      tableIdent = TableIdentifier("not_skip_hive_metadata"),
-      userSpecifiedSchema = Some(schema),
-      partitionColumns = Array.empty[String],
-      bucketSpec = None,
-      provider = "parquet",
-      options = Map("path" -> "just a dummy path", "skipHiveMetadata" -> "false"),
-      isExternal = false)
-
-    // As a proxy for verifying that the table was stored in Hive compatible format, we verify that
-    // each column of the table is of native type StringType.
-    assert(sessionState.catalog.client.getTable("default", "not_skip_hive_metadata").schema
-      .forall(column => HiveMetastoreTypes.toDataType(column.dataType) == StringType))
-
-    sessionState.catalog.createDataSourceTable(
-      tableIdent = TableIdentifier("skip_hive_metadata"),
-      userSpecifiedSchema = Some(schema),
-      partitionColumns = Array.empty[String],
-      bucketSpec = None,
-      provider = "parquet",
-      options = Map("path" -> "just a dummy path", "skipHiveMetadata" -> "true"),
-      isExternal = false)
-
-    // As a proxy for verifying that the table was stored in SparkSQL format, we verify that
-    // the table has a column type as array of StringType.
-    assert(sessionState.catalog.client.getTable("default", "skip_hive_metadata").schema
-      .forall(column => HiveMetastoreTypes.toDataType(column.dataType) == ArrayType(StringType)))
+    withTempDir(tempPath => {
+      val schema = StructType((1 to 5).map(i => StructField(s"c_$i", StringType)))
+
+      sessionState.catalog.createDataSourceTable(
+        tableIdent = TableIdentifier("not_skip_hive_metadata"),
+        userSpecifiedSchema = Some(schema),
+        partitionColumns = Array.empty[String],
+        bucketSpec = None,
+        provider = "parquet",
+        options = Map("path" -> tempPath.getCanonicalPath, "skipHiveMetadata" -> "false"),
+        isExternal = false)
+
+      // As a proxy for verifying that the table was stored in Hive compatible format,
+      // we verify that each column of the table is of native type StringType.
+      assert(sessionState.catalog.client.getTable("default", "not_skip_hive_metadata").schema
+        .forall(column => HiveMetastoreTypes.toDataType(column.dataType) == StringType))
+
+      sessionState.catalog.createDataSourceTable(
+        tableIdent = TableIdentifier("skip_hive_metadata"),
+        userSpecifiedSchema = Some(schema),
+        partitionColumns = Array.empty[String],
+        bucketSpec = None,
+        provider = "parquet",
+        options = Map("path" -> tempPath.getCanonicalPath, "skipHiveMetadata" -> "true"),
+        isExternal = false)
+
+      // As a proxy for verifying that the table was stored in SparkSQL format, we verify that
+      // the table has a column type as array of StringType.
+      assert(sessionState.catalog.client.getTable("default", "skip_hive_metadata").schema
+        .forall(column => HiveMetastoreTypes.toDataType(column.dataType) == ArrayType(StringType)))
+    })
   }
 }
author	Sunitha Kambhampati <skambha@us.ibm.com>	2016-03-22 20:47:57 +0800
committer	Cheng Lian <lian@databricks.com>	2016-03-22 20:47:57 +0800
commit	0ce01635cc66ca5f9d8962235054335b16f7507e (patch)
tree	8789537071873053028d5b65aca0532cda754953 /sql/hive
parent	4e09a0d5ea50d1cfc936bc87cf3372b4a0aa7dc2 (diff)
download	spark-0ce01635cc66ca5f9d8962235054335b16f7507e.tar.gz spark-0ce01635cc66ca5f9d8962235054335b16f7507e.tar.bz2 spark-0ce01635cc66ca5f9d8962235054335b16f7507e.zip