aboutsummaryrefslogtreecommitdiff
path: root/sql/hive
Commit message (Collapse)AuthorAgeFilesLines
* [MINOR][MLLIB][STREAMING][SQL] Fix typoslfzCarlosC2016-05-251-1/+1
| | | | | | | | | | fixed typos for source code for components [mllib] [streaming] and [SQL] None and obvious. Author: lfzCarlosC <lfz.carlos@gmail.com> Closes #13298 from lfzCarlosC/master.
* [SPARK-15436][SQL] Remove DescribeFunction and ShowFunctionsReynold Xin2016-05-251-3/+3
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly. ## How was this patch tested? Created a new SparkSqlParserSuite. Author: Reynold Xin <rxin@databricks.com> Closes #13292 from rxin/SPARK-15436.
* [SPARK-15365][SQL] When table size statistics are not available from ↵Parth Brahmbhatt2016-05-242-9/+74
| | | | | | | | | | | | | | metastore, we should fallback to HDFS ## What changes were proposed in this pull request? Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins. ## How was this patch tested? I have executed queries locally to test. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #13150 from Parth-Brahmbhatt/SPARK-15365.
* [SPARK-15388][SQL] Fix spark sql CREATE FUNCTION with hive 1.2.1wangyang2016-05-242-2/+13
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1. I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching. ## How was this patch tested? added a unit test and also tested it manually Author: wangyang <wangyang@haizhi.com> Closes #13177 from wangyang1992/fixCreateFunc2.
* [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command nativelyXin Wu2016-05-231-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])` ### For example: ##### LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +----------------------------------------------+ |result | +----------------------------------------------+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| +----------------------------------------------+ scala> spark.sql("list files").show(false) +----------------------------------------------+ |result | +----------------------------------------------+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt | +----------------------------------------------+ ``` ##### LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +---------------------------------------------+ |result | +---------------------------------------------+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +---------------------------------------------+ scala> spark.sql("list jars").show(false) +---------------------------------------------+ |result | +---------------------------------------------+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +---------------------------------------------+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. Author: Xin Wu <xinwu@us.ibm.com> Author: xin Wu <xinwu@us.ibm.com> Closes #13212 from xwu0226/list_command.
* [SPARK-15279][SQL] Catch conflicting SerDe when creating tableAndrew Or2016-05-231-3/+1
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The user may do something like: ``` CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde' CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde' ``` None of these should be allowed because the SerDe's conflict. As of this patch: - `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE` - `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE` ## How was this patch tested? New tests in `DDLCommandSuite`. Author: Andrew Or <andrew@databricks.com> Closes #13068 from andrewor14/row-format-conflict.
* [SPARK-15425][SQL] Disallow cross joins by defaultSameer Agarwal2016-05-222-0/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products. ## How was this patch tested? Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products. Author: Sameer Agarwal <sameer@databricks.com> Closes #13209 from sameeragarwal/disallow-cartesian.
* [SPARK-15206][SQL] add testcases for distinct aggregate in having clausexin Wu2016-05-211-0/+31
| | | | | | | | | | ## What changes were proposed in this pull request? Add new test cases for including distinct aggregate in having clause in 2.0 branch. This is a followup PR for [#12974](https://github.com/apache/spark/pull/12974), which is for 1.6 branch. Author: xin Wu <xinwu@us.ibm.com> Closes #12984 from xwu0226/SPARK-15206.
* [SPARK-15280] Input/Output] Refactored OrcOutputWriter and moved ↵Ergin Seyfe2016-05-211-39/+45
| | | | | | | | | | | | | | serialization to a new class. ## What changes were proposed in this pull request? Refactoring: Separated ORC serialization logic from OrcOutputWriter and moved to a new class called OrcSerializer. ## How was this patch tested? Manual tests & existing tests. Author: Ergin Seyfe <eseyfe@fb.com> Closes #13066 from seyfe/orc_serializer.
* [SPARK-15424][SPARK-15437][SPARK-14807][SQL] Revert Create a ↵Reynold Xin2016-05-202-0/+174
| | | | | | | | | | | | | | hivecontext-compatibility module ## What changes were proposed in this pull request? I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class. ## How was this patch tested? Tests were moved. Author: Reynold Xin <rxin@databricks.com> Closes #13207 from rxin/SPARK-15424.
* [SPARK-10216][SQL] Revert "[] Avoid creating empty files during overwrit…Michael Armbrust2016-05-203-69/+18
| | | | | | | | This reverts commit 8d05a7a from #12855, which seems to have caused regressions when working with empty DataFrames. Author: Michael Armbrust <michael@databricks.com> Closes #13181 from marmbrus/revert12855.
* [SPARK-15435][SQL] Append Command to all commandsReynold Xin2016-05-205-21/+21
| | | | | | | | | | | | ## What changes were proposed in this pull request? We started this convention to append Command suffix to all SQL commands. However, not all commands follow that convention. This patch adds Command suffix to all RunnableCommands. ## How was this patch tested? Updated test cases to reflect the renames. Author: Reynold Xin <rxin@databricks.com> Closes #13215 from rxin/SPARK-15435.
* [SPARK-15367][SQL] Add refreshTable backgatorsmile2016-05-202-4/+4
| | | | | | | | | | | | | | #### What changes were proposed in this pull request? `refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`. In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`). #### How was this patch tested? Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility` Author: gatorsmile <gatorsmile@gmail.com> Closes #13156 from gatorsmile/refreshTable.
* [SPARK-15335][SQL] Implement TRUNCATE TABLE CommandLianhui Wang2016-05-191-0/+79
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446 This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005). ## How was this patch tested? Added a test case. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13170 from lianhuiwang/truncate.
* [SPARK-14261][SQL] Memory leak in Spark Thrift ServerOleg Danilov2016-05-191-0/+2
| | | | | | | | Fixed memory leak (HiveConf in the CommandProcessorFactory) Author: Oleg Danilov <oleg.danilov@wandisco.com> Closes #12932 from dosoft/SPARK-14261.
* [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate ↵Reynold Xin2016-05-194-24/+24
| | | | | | | | | | | | | | | | config options to existing sessions if specified ## What changes were proposed in this pull request? Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that. This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession. ## How was this patch tested? Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches. Author: Reynold Xin <rxin@databricks.com> Closes #13200 from rxin/SPARK-15075.
* [SPARK-14346][SQL] Lists unsupported Hive features in SHOW CREATE TABLE outputCheng Lian2016-05-192-8/+19
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message. ## How was this patch tested? Updated existing test case to check exception message. Author: Cheng Lian <lian@databricks.com> Closes #13173 from liancheng/spark-14346-follow-up.
* [SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10Cheng Lian2016-05-181-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes a Scala 2.10 compilation failure introduced in PR #13127. ## How was this patch tested? Jenkins build. Author: Cheng Lian <lian@databricks.com> Closes #13166 from liancheng/hotfix-for-scala-2.10.
* [SPARK-15334][SQL] HiveClient facade not compatible with Hive 0.12Sean Zhong2016-05-183-123/+545
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? HiveClient facade is not compatible with Hive 0.12. This PR Fixes the following compatibility issues: 1. `org.apache.spark.sql.hive.client.HiveClientImpl` use `AddPartitionDesc(db, table, ignoreIfExists)` to create partitions, however, Hive 0.12 doesn't have this constructor for `AddPartitionDesc`. 2. `HiveClientImpl` uses `PartitionDropOptions` when dropping partition, however, class `PartitionDropOptions` doesn't exist in Hive 0.12. 3. Hive 0.12 doesn't support adding permanent functions. It is not valid to call `org.apache.hadoop.hive.ql.metadata.Hive.createFunction`, `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`, and `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction` 4. `org.apache.spark.sql.hive.client.VersionsSuite` doesn't have enough test coverage for different hive versions 0.12, 0.13, 0.14, 1.0.0, 1.1.0, 1.2.0. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13127 from clockfly/versionSuite.
* [SPARK-15171][SQL] Remove the references to deprecated method ↵Sean Zhong2016-05-1826-119/+120
| | | | | | | | | | | | | | | | | dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.
* [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/viewsCheng Lian2016-05-172-9/+186
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc.. ## How was this patch tested? New test cases are added in `ShowCreateTableSuite` to do round-trip tests. Author: Cheng Lian <lian@databricks.com> Closes #13079 from liancheng/spark-14346-show-create-table-for-hive-tables.
* [SPARK-10216][SQL] Avoid creating empty files during overwriting with group ↵hyukjinkwon2016-05-173-18/+69
| | | | | | | | | | | | | | | | | | | | | | | by query ## What changes were proposed in this pull request? Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files. This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources with group by query. This checks whether the given partition has data in it or not and creates/writes file only when it actually has data. ## How was this patch tested? Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`. Closes #8411 Author: hyukjinkwon <gurwls223@gmail.com> Author: Keuntae Park <sirpkt@apache.org> Closes #12855 from HyukjinKwon/pr/8411.
* [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into ↵Sean Owen2016-05-171-1/+1
| | | | | | | | | | | | | | | | | | spark-tags ## What changes were proposed in this pull request? (See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.) Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags` ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13074 from srowen/SPARK-15290.
* [SPARK-15253][SQL] Support old table schema config key ↵Sean Zhong2016-05-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | "spark.sql.sources.schema" for DESCRIBE TABLE ## What changes were proposed in this pull request? "DESCRIBE table" is broken when table schema is stored at key "spark.sql.sources.schema". Originally, we used spark.sql.sources.schema to store the schema of a data source table. After SPARK-6024, we removed this flag. Although we are not using spark.sql.sources.schema any more, we need to still support it. ## How was this patch tested? Unit test. When using spark2.0 to load a table generated by spark 1.2. Before change: `DESCRIBE table` => Schema of this table is inferred at runtime,, After change: `DESCRIBE table` => correct output. Author: Sean Zhong <seanzhong@databricks.com> Closes #13073 from clockfly/spark-15253.
* [SPARK-15267][SQL] Refactor options for JDBC and ORC data sources and change ↵hyukjinkwon2016-05-134-43/+81
| | | | | | | | | | | | | | | | | | | | | | default compression for ORC ## What changes were proposed in this pull request? Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`). It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`. Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`. ## How was this patch tested? Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13048 from HyukjinKwon/SPARK-15267.
* [SPARK-14541][SQL] Support IFNULL, NULLIF, NVL and NVL2Reynold Xin2016-05-121-1/+0
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds support for a few SQL functions to improve compatibility with other databases: IFNULL, NULLIF, NVL and NVL2. In order to do this, this patch introduced a RuntimeReplaceable expression trait that allows replacing an unevaluable expression in the optimizer before evaluation. Note that the semantics are not completely identical to other databases in esoteric cases. ## How was this patch tested? Added a new test suite SQLCompatibilityFunctionSuite. Closes #12373. Author: Reynold Xin <rxin@databricks.com> Closes #13084 from rxin/SPARK-14541.
* [SPARK-10605][SQL] Create native collect_list/collect_set aggregatesHerman van Hovell2016-05-122-27/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We currently use the Hive implementations for the collect_list/collect_set aggregate functions. This has a few major drawbacks: the use of HiveUDAF (which has quite a bit of overhead) and the lack of support for struct datatypes. This PR adds native implementation of these functions to Spark. The size of the collected list/set may vary, this means we cannot use the fast, Tungsten, aggregation path to perform the aggregation, and that we fallback to the slower sort based path. Another big issue with these operators is that when the size of the collected list/set grows too large, we can start experiencing large GC pauzes and OOMEs. This `collect*` aggregates implemented in this PR rely on the sort based aggregate path for correctness. They maintain their own internal buffer which holds the rows for one group at a time. The sortbased aggregation path is triggered by disabling `partialAggregation` for these aggregates (which is kinda funny); this technique is also employed in `org.apache.spark.sql.hiveHiveUDAFFunction`. I have done some performance testing: ```scala import org.apache.spark.sql.{Dataset, Row} sql("create function collect_list2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList'") val df = range(0, 10000000).select($"id", (rand(213123L) * 100000).cast("int").as("grp")) df.select(countDistinct($"grp")).show def benchmark(name: String, plan: Dataset[Row], maxItr: Int = 5): Unit = { // Do not measure planning. plan1.queryExecution.executedPlan // Execute the plan a number of times and average the result. val start = System.nanoTime var i = 0 while (i < maxItr) { plan.rdd.foreach(row => Unit) i += 1 } val time = (System.nanoTime - start) / (maxItr * 1000000L) println(s"[$name] $maxItr iterations completed in an average time of $time ms.") } val plan1 = df.groupBy($"grp").agg(collect_list($"id")) val plan2 = df.groupBy($"grp").agg(callUDF("collect_list2", $"id")) benchmark("Spark collect_list", plan1) ... > [Spark collect_list] 5 iterations completed in an average time of 3371 ms. benchmark("Hive collect_list", plan2) ... > [Hive collect_list] 5 iterations completed in an average time of 9109 ms. ``` Performance is improved by a factor 2-3. ## How was this patch tested? Added tests to `DataFrameAggregateSuite`. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12874 from hvanhovell/implode.
* [SPARK-14684][SPARK-15277][SQL] Partition Spec Validation in SessionCatalog ↵gatorsmile2016-05-122-14/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and Checking Partition Spec Existence Before Dropping #### What changes were proposed in this pull request? ~~Currently, multiple partitions are allowed to drop by using a single DDL command: Alter Table Drop Partition. However, the internal implementation could break atomicity. That means, we could just drop a subset of qualified partitions, if hitting an exception when dropping one of qualified partitions~~ ~~This PR contains the following behavior changes:~~ ~~- disallow dropping multiple partitions by a single command ~~ ~~- allow users to input predicates in partition specification and issue a nicer error message if the predicate's comparison operator is not `=`.~~ ~~- verify the partition spec in SessionCatalog. This can ensure each partition spec in `Drop Partition` does not correspond to multiple partitions.~~ This PR has two major parts: - Verify the partition spec in SessionCatalog for fixing the following issue: ```scala sql(s"ALTER TABLE $externalTab DROP PARTITION (ds='2008-04-09', unknownCol='12')") ``` Above example uses an invalid partition spec. Without this PR, we will drop all the partitions. The reason is Hive megastores getPartitions API returns all the partitions if we provide an invalid spec. - Re-implemented the `dropPartitions` in `HiveClientImpl`. Now, we always check if all the user-specified partition specs exist before attempting to drop the partitions. Previously, we start drop the partition before completing checking the existence of all the partition specs. If any failure happened after we start to drop the partitions, we will log an error message to indicate which partitions have been dropped and which partitions have not been dropped. #### How was this patch tested? Modified the existing test cases and added new test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12801 from gatorsmile/banDropMultiPart.
* [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempViewSean Zhong2016-05-127-9/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.
* [SPARK-15160][SQL] support data source table in InMemoryCatalogWenchen Fan2016-05-111-0/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds a new rule to convert `SimpleCatalogRelation` to data source table if its table property contains data source information. ## How was this patch tested? new test in SQLQuerySuite Author: Wenchen Fan <wenchen@databricks.com> Closes #12935 from cloud-fan/ds-table.
* [SPARK-14346] SHOW CREATE TABLE for data source tablesCheng Lian2016-05-114-7/+172
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s). To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature. Author: Cheng Lian <lian@databricks.com> Closes #12781 from liancheng/spark-14346-show-create-table.
* [SPARK-15276][SQL] CREATE TABLE with LOCATION should imply EXTERNALAndrew Or2016-05-112-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Before: ```sql -- uses that location but issues a warning CREATE TABLE my_tab LOCATION /some/path -- deletes any existing data in the specified location DROP TABLE my_tab ``` After: ```sql -- uses that location but creates an EXTERNAL table instead CREATE TABLE my_tab LOCATION /some/path -- does not delete the data at /some/path DROP TABLE my_tab ``` This patch essentially makes the `EXTERNAL` field optional. This is related to #13032. ## How was this patch tested? New test in `DDLCommandSuite`. Author: Andrew Or <andrew@databricks.com> Closes #13060 from andrewor14/location-implies-external.
* [SPARK-15257][SQL] Require CREATE EXTERNAL TABLE to specify LOCATIONAndrew Or2016-05-115-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Before: ```sql -- uses warehouse dir anyway CREATE EXTERNAL TABLE my_tab -- doesn't actually delete the data DROP TABLE my_tab ``` After: ```sql -- no location is provided, throws exception CREATE EXTERNAL TABLE my_tab -- creates an external table using that location CREATE EXTERNAL TABLE my_tab LOCATION '/path/to/something' -- doesn't delete the data, which is expected DROP TABLE my_tab ``` ## How was this patch tested? New test in `DDLCommandSuite` Author: Andrew Or <andrew@databricks.com> Closes #13032 from andrewor14/create-external-table-location.
* [SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from ↵Tathagata Das2016-05-112-8/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | partition specs of a partitioned metastore table Table partitions can be added with locations different from default warehouse location of a hive table. `CREATE TABLE parquetTable (a int) PARTITIONED BY (b int) STORED AS parquet ` `ALTER TABLE parquetTable ADD PARTITION (b=1) LOCATION '/partition'` Querying such a table throws error as the MetastoreFileCatalog does not list the added partition directory, it only lists the default base location. ``` [info] - SPARK-15248: explicitly added partitions should be readable *** FAILED *** (1 second, 8 milliseconds) [info] java.util.NoSuchElementException: key not found: file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition [info] at scala.collection.MapLike$class.default(MapLike.scala:228) [info] at scala.collection.AbstractMap.default(Map.scala:59) [info] at scala.collection.MapLike$class.apply(MapLike.scala:141) [info] at scala.collection.AbstractMap.apply(Map.scala:59) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:59) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:55) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) [info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) [info] at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:104) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.listFiles(PartitioningAwareFileCatalog.scala:55) [info] at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:93) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) [info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55) [info] at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:55) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59) [info] at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) [info] at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) [info] at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60) [info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77) [info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) [info] at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82) [info] at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82) [info] at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:330) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:146) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:159) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:554) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:535) [info] at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:125) [info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTempDir(parquetSuites.scala:726) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7.apply$mcV$sp(parquetSuites.scala:535) [info] at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166) [info] at org.apache.spark.sql.hive.ParquetPartitioningTest.withTable(parquetSuites.scala:726) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply$mcV$sp(parquetSuites.scala:534) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534) [info] at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534) ``` The solution in this PR to get the paths to list from the partition spec and not rely on the default table path alone. unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13022 from tdas/SPARK-15248.
* [SPARK-14933][HOTFIX] Replace `sqlContext` with `spark`.Dongjoon Hyun2016-05-111-4/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This fixes compile errors. ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13053 from dongjoon-hyun/hotfix_sqlquerysuite.
* [SPARK-14933][SQL] Failed to create view out of a parquet or orc tablexin Wu2016-05-113-2/+62
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? #### Symptom If a table is created as parquet or ORC table with hive syntaxt DDL, such as ```SQL create table t1 (c1 int, c2 string) stored as parquet ``` The following command will fail ```SQL create view v1 as select * from t1 ``` #### Root Cause Currently, `HiveMetaStoreCatalog` converts Paruqet/Orc tables to `LogicalRelation` without giving any `tableIdentifier`. `SQLBuilder` expects the `LogicalRelation` to have an associated `tableIdentifier`. However, the `LogicalRelation` created earlier does not have such a `tableIdentifier`. Thus, `SQLBuilder.toSQL` can not recognize this logical plan and issue an exception. This PR is to assign a `TableIdentifier` to the `LogicalRelation` when resolving parquet or orc tables in `HiveMetaStoreCatalog`. ## How was this patch tested? testcases created and dev/run-tests is run. Author: xin Wu <xinwu@us.ibm.com> Closes #12716 from xwu0226/SPARK_14933.
* [SPARK-15231][SQL] Document the semantic of saveAsTable and insertInto and ↵Shixiong Zhu2016-05-101-0/+43
| | | | | | | | | | | | | | | | don't drop columns silently ## What changes were proposed in this pull request? This PR adds documents about the different behaviors between `insertInto` and `saveAsTable`, and throws an exception when the user try to add too man columns using `saveAsTable with append`. ## How was this patch tested? Unit tests added in this PR. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13013 from zsxwing/SPARK-15231.
* [SPARK-15249][SQL] Use FunctionResource instead of (String, String) in ↵Sandeep Singh2016-05-102-5/+6
| | | | | | | | | | | | | | | CreateFunction and CatalogFunction for resource Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource see: TODO's here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L36 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala#L42 Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13024 from techaddict/SPARK-15249.
* [SPARK-15037][HOTFIX] Replace `sqlContext` and `sparkSession` with `spark`.Dongjoon Hyun2016-05-101-2/+2
| | | | | | | | | | This replaces `sparkSession` with `spark` in CatalogSuite.scala. Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13030 from dongjoon-hyun/hotfix_sparkSession.
* [SPARK-14603][SQL] Verification of Metadata Operations by Session Cataloggatorsmile2016-05-102-17/+19
| | | | | | | | | | | | | | | | | | Since we cannot really trust if the underlying external catalog can throw exceptions when there is an invalid metadata operation, let's do it in SessionCatalog. - [X] The first step is to unify the error messages issued in Hive-specific Session Catalog and general Session Catalog. - [X] The second step is to verify the inputs of metadata operations for partitioning-related operations. This is moved to a separate PR: https://github.com/apache/spark/pull/12801 - [X] The third step is to add database existence verification in `SessionCatalog` - [X] The fourth step is to add table existence verification in `SessionCatalog` - [X] The fifth step is to add function existence verification in `SessionCatalog` Add test cases and verify the error messages we issued Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12385 from gatorsmile/verifySessionAPIs.
* [SPARK-15037][SQL][MLLIB] Use SparkSession instead of SQLContext in ↵Sandeep Singh2016-05-1025-249/+249
| | | | | | | | | | | | | | | Scala/Java TestSuites ## What changes were proposed in this pull request? Use SparkSession instead of SQLContext in Scala/Java TestSuites as this PR already very big working Python TestSuites in a diff PR. ## How was this patch tested? Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #12907 from techaddict/SPARK-15037.
* [SPARK-14773] [SPARK-15179] [SQL] Fix SQL building and enable Hive testsHerman van Hovell2016-05-1060-0/+2171
| | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes SQL building for predicate subqueries and correlated scalar subqueries. It also enables most Hive subquery tests. ## How was this patch tested? Enabled new tests in HiveComparisionSuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12988 from hvanhovell/SPARK-14773.
* [SPARK-14127][SQL] "DESC <table>": Extracts schema information from table ↵Cheng Lian2016-05-101-1/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | properties for data source tables ## What changes were proposed in this pull request? This is a follow-up of #12934 and #12844. This PR adds a set of utility methods in `DDLUtils` to help extract schema information (user-defined schema, partition columns, and bucketing information) from data source table properties. These utility methods are then used in `DescribeTableCommand` to refine output for data source tables. Before this PR, the aforementioned schema information are only shown as table properties, which are hard to read. Sample output: ``` +----------------------------+---------------------------------------------------------+-------+ |col_name |data_type |comment| +----------------------------+---------------------------------------------------------+-------+ |a |bigint | | |b |bigint | | |c |bigint | | |d |bigint | | |# Partition Information | | | |# col_name | | | |d | | | | | | | |# Detailed Table Information| | | |Database: |default | | |Owner: |lian | | |Create Time: |Tue May 10 03:20:34 PDT 2016 | | |Last Access Time: |Wed Dec 31 16:00:00 PST 1969 | | |Location: |file:/Users/lian/local/src/spark/workspace-a/target/... | | |Table Type: |MANAGED | | |Table Parameters: | | | | rawDataSize |-1 | | | numFiles |1 | | | transient_lastDdlTime |1462875634 | | | totalSize |684 | | | spark.sql.sources.provider|parquet | | | EXTERNAL |FALSE | | | COLUMN_STATS_ACCURATE |false | | | numRows |-1 | | | | | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | |InputFormat: |org.apache.hadoop.mapred.SequenceFileInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat| | |Compressed: |No | | |Num Buckets: |2 | | |Bucket Columns: |[b] | | |Sort Columns: |[c] | | |Storage Desc Parameters: | | | | path |file:/Users/lian/local/src/spark/workspace-a/target/... | | | serialization.format |1 | | +----------------------------+---------------------------------------------------------+-------+ ``` ## How was this patch tested? Test cases are added in `HiveDDLSuite` to check command output. Author: Cheng Lian <lian@databricks.com> Closes #13025 from liancheng/spark-14127-extract-schema-info.
* [SPARK-15215][SQL] Fix Explain Parsing and Outputgatorsmile2016-05-101-15/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is to address a few existing issues in `EXPLAIN`: - The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command. - The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command. - The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example: ``` == Parsed Logical Plan == CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false == Analyzed Logical Plan == CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false == Optimized Logical Plan == CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io. HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false ... ``` #### How was this patch tested? Added and modified a few test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12991 from gatorsmile/explainCreateTable.
* [SPARK-15187][SQL] Disallow Dropping Default Databasegatorsmile2016-05-102-10/+24
| | | | | | | | | | | | | | #### What changes were proposed in this pull request? In Hive Metastore, dropping default database is not allowed. However, in `InMemoryCatalog`, this is allowed. This PR is to disallow users to drop default database. #### How was this patch tested? Previously, we already have a test case in HiveDDLSuite. Now, we also add the same one in DDLSuite Author: gatorsmile <gatorsmile@gmail.com> Closes #12962 from gatorsmile/dropDefaultDB.
* [SPARK-15025][SQL] fix duplicate of PATH key in datasource table optionsxin Wu2016-05-091-3/+26
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The issue is that when the user provides the path option with uppercase "PATH" key, `options` contains `PATH` key and will get into the non-external case in the following code in `createDataSourceTables.scala`, where a new key "path" is created with a default path. ``` val optionsWithPath = if (!options.contains("path")) { isExternal = false options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent)) } else { options } ``` So before creating hive table, serdeInfo.parameters will contain both "PATH" and "path" keys and different directories. and Hive table's dataLocation contains the value of "path". The fix in this PR is to convert `options` in the code above to `CaseInsensitiveMap` before checking for containing "path" key. ## How was this patch tested? A testcase is added Author: xin Wu <xinwu@us.ibm.com> Closes #12804 from xwu0226/SPARK-15025.
* [SPARK-15173][SQL] DataFrameWriter.insertInto should work with datasource ↵Wenchen Fan2016-05-091-0/+8
| | | | | | | | | | | | table stored in hive When we parse `CREATE TABLE USING`, we should build a `CreateTableUsing` plan with the `managedIfNoPath` set to true. Then we will add default table path to options when write it to hive. new test in `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12949 from cloud-fan/bug.
* [SPARK-15166][SQL] Move some hive-specific code from SparkSessionAndrew Or2016-05-092-6/+13
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This also simplifies the code being moved. ## How was this patch tested? Existing tests. Author: Andrew Or <andrew@databricks.com> Closes #12941 from andrewor14/move-code.
* [SPARK-14127][SQL] Makes 'DESC [EXTENDED|FORMATTED] <table>' support data ↵Cheng Lian2016-05-091-1/+18
| | | | | | | | | | | | | | | | source tables ## What changes were proposed in this pull request? This is a follow-up of PR #12844. It makes the newly updated `DescribeTableCommand` to support data sources tables. ## How was this patch tested? A test case is added to check `DESC [EXTENDED | FORMATTED] <table>` output. Author: Cheng Lian <lian@databricks.com> Closes #12934 from liancheng/spark-14127-desc-table-follow-up.
* [SPARK-14459][SQL] Detect relation partitioning and adjust the logical planRyan Blue2016-05-093-6/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns. ## How was this patch tested? This adds new tests to the InsertIntoTableSuite that are fixed by this PR. Author: Ryan Blue <blue@apache.org> Closes #12239 from rdblue/SPARK-14459-detect-hive-partitioning.