aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-19497][SS] Implement streaming deduplicationShixiong Zhu2017-02-2314-58/+572
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`. The following cases are supported: - one or multiple `dropDuplicates()` without aggregation (with or without watermark) - `dropDuplicates` before aggregation Not supported cases: - `dropDuplicates` after aggregation Breaking changes: - `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16970 from zsxwing/dedup.
* [SPARK-19459] Support for nested char/varchar fields in ORCHerman van Hovell2017-02-233-19/+100
| | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is a small follow-up on https://github.com/apache/spark/pull/16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #17030 from hvanhovell/SPARK-19459-follow-up.
* [SPARK-19691][SQL] Fix ClassCastException when calculating percentile of ↵Takeshi Yamamuro2017-02-233-37/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | decimal column ## What changes were proposed in this pull request? This pr fixed a class-cast exception below; ``` scala> spark.range(10).selectExpr("cast (id as decimal) as x").selectExpr("percentile(x, 0.5)").collect() java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Number at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141) at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at ``` This fix simply converts catalyst values (i.e., `Decimal`) into scala ones by using `CatalystTypeConverters`. ## How was this patch tested? Added a test in `DataFrameSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17028 from maropu/SPARK-19691.
* [SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord` field ↵Takeshi Yamamuro2017-02-224-4/+56
| | | | | | | | | | | | | | violates requirements in json formats ## What changes were proposed in this pull request? This pr comes from #16928 and fixed a json behaviour along with the CSV one. ## How was this patch tested? Added tests in `JsonSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17023 from maropu/SPARK-19695.
* [SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing ↵pj.fanning2017-02-221-2/+18
| | | | | | | | | | | | | | | | | | | | | JSON ## What changes were proposed in this pull request? SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent. SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent. I am deprecating the existing RDD based APIs. ## How was this patch tested? There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pj.fanning <pj.fanning@workday.com> Author: PJ Fanning <pjfanning@users.noreply.github.com> Closes #16895 from pjfanning/SPARK-15615.
* [SPARK-19658][SQL] Set NumPartitions of RepartitionByExpression In ParserXiao Li2017-02-2210-32/+74
| | | | | | | | | | | | | | ### What changes were proposed in this pull request? Currently, if `NumPartitions` is not set in RepartitionByExpression, we will set it using `spark.sql.shuffle.partitions` during Planner. However, this is not following the general resolution process. This PR is to set it in `Parser` and then `Optimizer` can use the value for plan optimization. ### How was this patch tested? Added a test case. Author: Xiao Li <gatorsmile@gmail.com> Closes #16988 from gatorsmile/resolveRepartition.
* [SPARK-19666][SQL] Skip a property without getter in Java schema inference ↵hyukjinkwon2017-02-225-32/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and allow empty bean in encoder creation ## What changes were proposed in this pull request? This PR proposes to fix two. **Skip a property without a getter in beans** Currently, if we use a JavaBean without the getter as below: ```java public static class BeanWithoutGetter implements Serializable { private String a; public void setA(String a) { this.a = a; } } BeanWithoutGetter bean = new BeanWithoutGetter(); List<BeanWithoutGetter> data = Arrays.asList(bean); spark.createDataFrame(data, BeanWithoutGetter.class).show(); ``` - Before It throws an exception as below: ``` java.lang.NullPointerException at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) ``` - After ``` ++ || ++ || ++ ``` **Supports empty bean in encoder creation** ```java public static class EmptyBean implements Serializable {} EmptyBean bean = new EmptyBean(); List<EmptyBean> data = Arrays.asList(bean); spark.createDataset(data, Encoders.bean(EmptyBean.class)).show(); ``` - Before throws an exception as below: ``` java.lang.UnsupportedOperationException: Cannot infer type for class EmptyBean because it is not bean-compliant at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:436) at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:341) ``` - After ``` ++ || ++ || ++ ``` ## How was this patch tested? Unit test in `JavaDataFrameSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17013 from HyukjinKwon/SPARK-19666.
* [SPARK-13721][SQL] Make GeneratorOuter unresolved.Bogdan Raducanu2017-02-222-2/+8
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a small change to make GeneratorOuter always unresolved. It is mostly no-op change but makes it more clear since GeneratorOuter shouldn't survive analysis phase. This requires also handling in ResolveAliases rule. ## How was this patch tested? Existing generator tests. Author: Bogdan Raducanu <bogdan@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #17026 from bogdanrdc/PR16958.
* [SPARK-19670][SQL][TEST] Enable Bucketed Table Reading and Writing Testing ↵Xiao Li2017-02-214-21/+101
| | | | | | | | | | | | | | Without Hive Support ### What changes were proposed in this pull request? Bucketed table reading and writing does not need Hive support. We can move the test cases from `sql/hive` to `sql/core`. After this PR, we can improve the test case coverage. Bucket table reading and writing can be tested with and without Hive support. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #17004 from gatorsmile/mvTestCaseForBuckets.
* [SPARK-18922][TESTS] Fix new test failures on Windows due to path and ↵hyukjinkwon2017-02-202-17/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | resource not closed ## What changes were proposed in this pull request? This PR proposes to fix new test failures on WIndows as below: **Before** ``` KafkaRelationSuite: - test late binding start offsets *** FAILED *** (7 seconds, 679 milliseconds) Cause: java.nio.file.FileSystemException: C:\projects\spark\target\tmp\spark-4c4b0cd1-4cb7-4908-949d-1b0cc8addb50\topic-4-0\00000000000000000000.log -> C:\projects\spark\target\tmp\spark-4c4b0cd1-4cb7-4908-949d-1b0cc8addb50\topic-4-0\00000000000000000000.log.deleted: The process cannot access the file because it is being used by another process. KafkaSourceSuite: - deserialization of initial offset with Spark 2.1.0 *** FAILED *** (3 seconds, 542 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-97ef64fc-ae61-4ce3-ac59-287fd38bd824 - deserialization of initial offset written by Spark 2.1.0 *** FAILED *** (60 milliseconds) java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/projects/spark/external/kafka-0-10-sql/target/scala-2.11/test-classes/kafka-source-initial-offset-version-2.1.0.b HiveDDLSuite: - partitioned table should always put partition columns at the end of table schema *** FAILED *** (657 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-f1b83d09-850a-4bba-8e43-a2a28dfaa757; DDLSuite: - create a data source table without schema *** FAILED *** (94 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-a3f3c161-afae-4d6f-9182-e8642f77062b; - SET LOCATION for managed table *** FAILED *** (219 milliseconds) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartit +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#99367L]) +- *FileScan parquet default.tbl[] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:projectsspark arget mpspark-15be2f2f-4ea9-4c47-bfee-1b7b49363033], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> - insert data to a data source table which has a not existed location should succeed *** FAILED *** (16 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-34987671-e8d1-4624-ba5b-db1012e1246b; - insert into a data source table with no existed partition location should succeed *** FAILED *** (16 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-4c6ccfbf-4091-4032-9fbc-3d40c58267d5; - read data from a data source table which has a not existed location should succeed *** FAILED *** (0 milliseconds) - read data from a data source table with no existed partition location should succeed *** FAILED *** (0 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-6af39e37-abd1-44e8-ac68-e2dfcf67a2f3; InputOutputMetricsSuite: - output metrics on records written *** FAILED *** (0 milliseconds) java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-cd69ee77-88f2-4202-bed6-19c0ee05ef55\InputOutputMetricsSuite, expected: file:/// - output metrics on records written - new Hadoop API *** FAILED *** (16 milliseconds) java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-b69e8fcb-047b-4de8-9cdf-5f026efb6762\InputOutputMetricsSuite, expected: file:/// ``` **After** ``` KafkaRelationSuite: - test late binding start offsets !!! CANCELED !!! (62 milliseconds) KafkaSourceSuite: - deserialization of initial offset with Spark 2.1.0 (5 seconds, 341 milliseconds) - deserialization of initial offset written by Spark 2.1.0 (910 milliseconds) HiveDDLSuite: - partitioned table should always put partition columns at the end of table schema (2 seconds) DDLSuite: - create a data source table without schema (828 milliseconds) - SET LOCATION for managed table (406 milliseconds) - insert data to a data source table which has a not existed location should succeed (406 milliseconds) - insert into a data source table with no existed partition location should succeed (453 milliseconds) - read data from a data source table which has a not existed location should succeed (94 milliseconds) - read data from a data source table with no existed partition location should succeed (265 milliseconds) InputOutputMetricsSuite: - output metrics on records written (172 milliseconds) - output metrics on records written - new Hadoop API (297 milliseconds) ``` ## How was this patch tested? Fixed tests in `InputOutputMetricsSuite`, `KafkaRelationSuite`, `KafkaSourceSuite`, `DDLSuite.scala` and `HiveDDLSuite`. Manually tested via AppVeyor as below: `InputOutputMetricsSuite`: https://ci.appveyor.com/project/spark-test/spark/build/633-20170219-windows-test/job/ex8nvwa6tsh7rmto `KafkaRelationSuite`: https://ci.appveyor.com/project/spark-test/spark/build/633-20170219-windows-test/job/h8dlcowew52y8ncw `KafkaSourceSuite`: https://ci.appveyor.com/project/spark-test/spark/build/634-20170219-windows-test/job/9ybgjl7yeubxcre4 `DDLSuite`: https://ci.appveyor.com/project/spark-test/spark/build/635-20170219-windows-test `HiveDDLSuite`: https://ci.appveyor.com/project/spark-test/spark/build/633-20170219-windows-test/job/up6o9n47er087ltb Author: hyukjinkwon <gurwls223@gmail.com> Closes #16999 from HyukjinKwon/windows-fix.
* [SPARK-19669][HOTFIX][SQL] sessionState access privileges compiled failed in ↵windpiger2017-02-201-1/+1
| | | | | | | | | | | | | | | | | TestSQLContext ## What changes were proposed in this pull request? In [SPARK-19669](https://github.com/apache/spark/commit/0733a54a4517b82291efed9ac7f7407d9044593c) change the sessionState access privileges from private to public, this lead to the compile failed in TestSQLContext this pr is a hotfix for this. ## How was this patch tested? N/A Author: windpiger <songjun@outlook.com> Closes #17008 from windpiger/hotfixcompile.
* [SPARK-19669][SQL] Open up visibility for sharedState, sessionState, and a ↵Reynold Xin2017-02-202-7/+26
| | | | | | | | | | | | | | | | few other functions ## What changes were proposed in this pull request? To ease debugging, most of Spark SQL internals have public level visibility. Two of the most important internal states, sharedState and sessionState, however, are package private. It would make more sense to open these up as well with clear documentation that they are internal. In addition, users currently have way to set active/default SparkSession, but no way to actually get them back. We should open those up as well. ## How was this patch tested? N/A - only visibility change. Author: Reynold Xin <rxin@databricks.com> Closes #17002 from rxin/SPARK-19669.
* [SPARK-15453][SQL][FOLLOW-UP] FileSourceScanExec to extract `outputOrdering` ↵Xiao Li2017-02-201-92/+137
| | | | | | | | | | | | | | information ### What changes were proposed in this pull request? `outputOrdering` is also dependent on whether the bucket has more than one files. The test cases fail when we try to move them to sql/core. This PR is to fix the test cases introduced in https://github.com/apache/spark/pull/14864 and add a test case to verify [the related logics](https://github.com/tejasapatil/spark/blob/070c24994747c0479fb2520774ede27ff1cf8cac/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L197-L206). ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #16994 from gatorsmile/bucketingTS.
* [SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriterWenchen Fan2017-02-191-99/+90
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed. However, if the data is already sorted, we will sort it again, which is unnecssary. This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted. ## How was this patch tested? I did a micro benchmark manually ``` val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part") spark.time(df.write.partitionBy("part").parquet("/tmp/test")) ``` The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards. close https://github.com/apache/spark/pull/16724 Author: Wenchen Fan <wenchen@databricks.com> Closes #16898 from cloud-fan/writer.
* [SPARK-19598][SQL] Remove the alias parameter in UnresolvedRelationwindpiger2017-02-1912-75/+51
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Remove the alias parameter in `UnresolvedRelation`, and use `SubqueryAlias` to replace it. This can simplify some `match case` situations. For example, the broadcast hint pull request can have one fewer case https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala#L57-L61 ## How was this patch tested? add some unit tests Author: windpiger <songjun@outlook.com> Closes #16956 from windpiger/removeUnresolveTableAlias.
* [SPARK-19534][TESTS] Convert Java tests to use lambdas, Java 8 featuresSean Owen2017-02-198-226/+108
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Convert tests to use Java 8 lambdas, and modest related fixes to surrounding code. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #16964 from srowen/SPARK-19534.
* [SPARK-19450] Replace askWithRetry with askSync.jinxing2017-02-191-4/+4
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `askSync` is already added in `RpcEndpointRef` (see SPARK-19347 and https://github.com/apache/spark/pull/16690#issuecomment-276850068) and `askWithRetry` is marked as deprecated. As mentioned SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218): >askWithRetry is basically an unneeded API, and a leftover from the akka days that doesn't make sense anymore. It's prone to cause deadlocks (exactly because it's blocking), it imposes restrictions on the caller (e.g. idempotency) and other things that people generally don't pay that much attention to when using it. Since `askWithRetry` is just used inside spark and not in user logic. It might make sense to replace all of them with `askSync`. ## How was this patch tested? This PR doesn't change code logic, existing unit test can cover. Author: jinxing <jinxing@meituan.com> Closes #16790 from jinxing64/SPARK-19450.
* [SPARK-19447] Make Range operator generate "recordsRead" metricAla Luszczak2017-02-186-155/+125
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The Range was modified to produce "recordsRead" metric instead of "generated rows". The tests were updated and partially moved to SQLMetricsSuite. ## How was this patch tested? Unit tests. Author: Ala Luszczak <ala@databricks.com> Closes #16960 from ala/range-records-read.
* [SPARK-19617][SS] Fix the race condition when starting and stopping a query ↵Shixiong Zhu2017-02-177-106/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | quickly ## What changes were proposed in this pull request? The streaming thread in StreamExecution uses the following ways to check if it should exit: - Catch an InterruptException. - `StreamExecution.state` is TERMINATED. When starting and stopping a query quickly, the above two checks may both fail: - Hit [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) and swallow InterruptException - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then [runBatches](https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252) changes the state from `TERMINATED` to `ACTIVE`. If the above cases both happen, the query will hang forever. This PR changes `state` to `AtomicReference` and uses`compareAndSet` to make sure we only change the state from `INITIALIZING` to `ACTIVE`. It also removes the `runUninterruptibly` hack from ``HDFSMetadata`, because HADOOP-14084 won't cause any problem after we fix the race condition. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16947 from zsxwing/SPARK-19617.
* [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMapDavies Liu2017-02-171-0/+40
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1). This PR fix the off-by-one bug in BytesToBytesMap. This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 . ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #16844 from davies/off_by_one.
* [SPARK-18120][SPARK-19557][SQL] Call QueryExecutionListener callback methods ↵Wenchen Fan2017-02-163-16/+142
| | | | | | | | | | | | | | | | | | for DataFrameWriter methods ## What changes were proposed in this pull request? We only notify `QueryExecutionListener` for several `Dataset` operations, e.g. collect, take, etc. We should also do the notification for `DataFrameWriter` operations. ## How was this patch tested? new regression test close https://github.com/apache/spark/pull/16664 Author: Wenchen Fan <wenchen@databricks.com> Closes #16962 from cloud-fan/insert.
* [SPARK-18352][SQL] Support parsing multiline json filesNathan Howell2017-02-1611-218/+679
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory. Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired. These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing. I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one. ## How was this patch tested? New and existing unit tests. No performance or load tests have been run. Author: Nathan Howell <nhowell@godaddy.com> Closes #16386 from NathanHowell/SPARK-18352.
* [SPARK-19436][SQL] Add missing tests for approxQuantileZheng RuiFeng2017-02-163-23/+88
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, check the behavior with illegal `quantiles` and `relativeError` 2, add tests for `relativeError` > 1 3, update tests for `null` data 4, update some docs for javadoc8 ## How was this patch tested? local test in spark-shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #16776 from zhengruifeng/fix_approxQuantile.
* [MINOR][BUILD] Fix javadoc8 breakhyukjinkwon2017-02-161-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? These error below seems caused by unidoc that does not understand double commented block. ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected [error] * MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt; mappingFunction = [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected [error] * MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt; mappingFunction = [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected [error] * new MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt;() { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected [error] * new MapGroupsWithStateFunction&lt;String, Integer, Integer, String&gt;() { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: illegal character: '#' [error] * &#64;Override [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: class, interface, or enum expected [error] * &#64;Override [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected [error] * public String call(String key, Iterator&lt;Integer&gt; value, KeyedState&lt;Integer&gt; state) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: class, interface, or enum expected [error] * boolean shouldRemove = ...; // Decide whether to remove the state [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: class, interface, or enum expected [error] * if (shouldRemove) { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: class, interface, or enum expected [error] * } else { [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: class, interface, or enum expected [error] * state.update(newState); // Set the new state [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: class, interface, or enum expected [error] * } [error] ^ [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: error: class, interface, or enum expected [error] * state.update(initialState); [error] ^ [error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: error: class, interface, or enum expected [error] * } [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: class, interface, or enum expected [error] * </code></pre> [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:92: error: class, interface, or enum expected [error] * tparam S User-defined type of the state to be stored for each key. Must be encodable into [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:93: error: class, interface, or enum expected [error] * Spark SQL types (see {link Encoder} for more details). [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:94: error: class, interface, or enum expected [error] * since 2.1.1 [error] ^ ``` And another link seems unrecognisable. ``` .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:16: error: reference not found [error] * That is, in every batch of the {link streaming.StreamingQuery StreamingQuery}, [error] ``` Note that this PR does not fix the two breaks as below: ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content [error] * see {link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>' [error] * param relativeError The relative target precision to achieve (>= 0). [error] ^ [error] ``` because these seem probably fixed soon in https://github.com/apache/spark/pull/16776 and I intended to avoid potential conflicts. ## How was this patch tested? Manually via `jekyll build` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16926 from HyukjinKwon/javadoc-break.
* [SPARK-19550][BUILD][CORE][WIP] Remove Java 7 supportSean Owen2017-02-1611-45/+73
| | | | | | | | | | | | | | | | | | | | | | | | - Move external/java8-tests tests into core, streaming, sql and remove - Remove MaxPermGen and related options - Fix some reflection / TODOs around Java 8+ methods - Update doc references to 1.7/1.8 differences - Remove Java 7/8 related build profiles - Update some plugins for better Java 8 compatibility - Fix a few Java-related warnings For the future: - Update Java 8 examples to fully use Java 8 - Update Java tests to use lambdas for simplicity - Update Java internal implementations to use lambdas ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16871 from srowen/SPARK-19493.
* [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 3rd batchKevin Yu2017-02-166-0/+1297
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is 3ird batch of test case for IN/NOT IN subquery. In this PR, it has these test files: `in-having.sql` `in-joins.sql` `in-multiple-columns.sql` These are the queries and results from running on DB2. [in-having DB2 version](https://github.com/apache/spark/files/772668/in-having.sql.db2.txt) [output of in-having](https://github.com/apache/spark/files/772670/in-having.sql.db2.out.txt) [in-joins DB2 version](https://github.com/apache/spark/files/772672/in-joins.sql.db2.txt) [output of in-joins](https://github.com/apache/spark/files/772673/in-joins.sql.db2.out.txt) [in-multiple-columns DB2 version](https://github.com/apache/spark/files/772678/in-multiple-columns.sql.db2.txt) [output of in-multiple-columns](https://github.com/apache/spark/files/772680/in-multiple-columns.sql.db2.out.txt) ## How was this patch tested? This pr is adding new test cases. We compare the result from spark with the result from another RDBMS(We used DB2 LUW). If the results are the same, we assume the result is correct. Author: Kevin Yu <qyu@us.ibm.com> Closes #16841 from kevinyu98/spark-18871-33.
* [SPARK-19618][SQL] Inconsistency wrt max. buckets allowed from Dataframe API ↵Tejas Patil2017-02-154-19/+25
| | | | | | | | | | | | | | | | | | vs SQL ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-19618 Moved the check for validating number of buckets from `DataFrameWriter` to `BucketSpec` creation ## How was this patch tested? - Added more unit tests Author: Tejas Patil <tejasp@fb.com> Closes #16948 from tejasapatil/SPARK-19618_max_buckets.
* [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 4th batchKevin Yu2017-02-156-0/+2114
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is 4th batch of test case for IN/NOT IN subquery. In this PR, it has these test files: `in-set-operations.sql` `in-with-cte.sql` `not-in-joins.sql` Here are the queries and results from running on DB2. [in-set-operations DB2 version](https://github.com/apache/spark/files/772846/in-set-operations.sql.db2.txt) [Output of in-set-operations](https://github.com/apache/spark/files/772848/in-set-operations.sql.db2.out.txt) [in-with-cte DB2 version](https://github.com/apache/spark/files/772849/in-with-cte.sql.db2.txt) [Output of in-with-cte](https://github.com/apache/spark/files/772856/in-with-cte.sql.db2.out.txt) [not-in-joins DB2 version](https://github.com/apache/spark/files/772851/not-in-joins.sql.db2.txt) [Output of not-in-joins](https://github.com/apache/spark/files/772852/not-in-joins.sql.db2.out.txt) ## How was this patch tested? This pr is adding new test cases. We compare the result from spark with the result from another RDBMS(We used DB2 LUW). If the results are the same, we assume the result is correct. Author: Kevin Yu <qyu@us.ibm.com> Closes #16915 from kevinyu98/spark-18871-44.
* [SPARK-19603][SS] Fix StreamingQuery explain commandShixiong Zhu2017-02-153-11/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `StreamingQuery.explain` doesn't show the correct streaming physical plan right now because `ExplainCommand` receives a runtime batch plan and its `logicalPlan.isStreaming` is always false. This PR adds `streaming` parameter to `ExplainCommand` to allow `StreamExecution` to specify that it's a streaming plan. Examples of the explain outputs: - streaming DataFrame.explain() ``` == Physical Plan == *HashAggregate(keys=[value#518], functions=[count(1)]) +- StateStoreSave [value#518], OperatorStateId(<unknown>,0,0), Append, 0 +- *HashAggregate(keys=[value#518], functions=[merge_count(1)]) +- StateStoreRestore [value#518], OperatorStateId(<unknown>,0,0) +- *HashAggregate(keys=[value#518], functions=[merge_count(1)]) +- Exchange hashpartitioning(value#518, 5) +- *HashAggregate(keys=[value#518], functions=[partial_count(1)]) +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518] +- *MapElements <function1>, obj#517: java.lang.String +- *DeserializeToObject value#513.toString, obj#516: java.lang.String +- StreamingRelation MemoryStream[value#513], [value#513] ``` - StreamingQuery.explain(extended = false) ``` == Physical Plan == *HashAggregate(keys=[value#518], functions=[count(1)]) +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0 +- *HashAggregate(keys=[value#518], functions=[merge_count(1)]) +- StateStoreRestore [value#518], OperatorStateId(...,0,0) +- *HashAggregate(keys=[value#518], functions=[merge_count(1)]) +- Exchange hashpartitioning(value#518, 5) +- *HashAggregate(keys=[value#518], functions=[partial_count(1)]) +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518] +- *MapElements <function1>, obj#517: java.lang.String +- *DeserializeToObject value#543.toString, obj#516: java.lang.String +- LocalTableScan [value#543] ``` - StreamingQuery.explain(extended = true) ``` == Parsed Logical Plan == Aggregate [value#518], [value#518, count(1) AS count(1)#524L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518] +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String +- LocalRelation [value#543] == Analyzed Logical Plan == value: string, count(1): bigint Aggregate [value#518], [value#518, count(1) AS count(1)#524L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518] +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String +- LocalRelation [value#543] == Optimized Logical Plan == Aggregate [value#518], [value#518, count(1) AS count(1)#524L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518] +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String +- DeserializeToObject value#543.toString, obj#516: java.lang.String +- LocalRelation [value#543] == Physical Plan == *HashAggregate(keys=[value#518], functions=[count(1)], output=[value#518, count(1)#524L]) +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0 +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L]) +- StateStoreRestore [value#518], OperatorStateId(...,0,0) +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L]) +- Exchange hashpartitioning(value#518, 5) +- *HashAggregate(keys=[value#518], functions=[partial_count(1)], output=[value#518, count#530L]) +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518] +- *MapElements <function1>, obj#517: java.lang.String +- *DeserializeToObject value#543.toString, obj#516: java.lang.String +- LocalTableScan [value#543] ``` ## How was this patch tested? The updated unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16934 from zsxwing/SPARK-19603.
* [SPARK-19599][SS] Clean up HDFSMetadataLogShixiong Zhu2017-02-152-24/+19
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog. This PR includes the following changes: - ~~Remove the workaround codes for HADOOP-10622.~~ Unfortunately, there is another issue [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) that prevents us from removing the workaround codes. - Remove unnecessary `writer: (T, OutputStream) => Unit` and just call `serialize` directly. - Remove catching FileNotFoundException. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16932 from zsxwing/metadata-cleanup.
* [SPARK-18937][SQL] Timezone support in CSV/JSON parsingTakuya UESHIN2017-02-1518-99/+312
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a follow-up pr of #16308. This pr enables timezone support in CSV/JSON parsing. We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone). The datasources should use the `timeZone` option to format/parse to write/read timestamp values. Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts") df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df.show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> df.write.json("/path/to/gmtjson") ``` ```sh $ cat /path/to/gmtjson/part-* {"ts":"2016-01-01T00:00:00.000Z"} ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").json("/path/to/pstjson") ``` ```sh $ cat /path/to/pstjson/part-* {"ts":"2015-12-31T16:00:00.000-08:00"} ``` We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info: ```scala scala> val schema = new StructType().add("ts", TimestampType) schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true)) scala> spark.read.schema(schema).json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option: ```scala scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson") ``` ```sh $ cat /path/to/jstjson/part-* {"ts":"2016-01-01T09:00:00"} ``` ```scala // wrong result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 09:00:00| +-------------------+ // correct result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option. ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #16750 from ueshin/issues/SPARK-18937.
* [SPARK-19329][SQL] Reading from or writing to a datasource table with a non ↵windpiger2017-02-152-1/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pre-existing location should succeed ## What changes were proposed in this pull request? when we insert data into a datasource table use `sqlText`, and the table has an not exists location, this will throw an Exception. example: ``` spark.sql("create table t(a string, b int) using parquet") spark.sql("alter table t set location '/xx'") spark.sql("insert into table t select 'c', 1") ``` Exception: ``` com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx; at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814) at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453) ``` As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton: 1. reading from a datasource table: return 0 rows 2. writing to a datasource table: write data successfully ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #16672 from windpiger/insertNotExistLocation.
* [SPARK-19607][HOTFIX] Finding QueryExecution that matches provided executionIdDongjoon Hyun2017-02-151-0/+2
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? #16940 adds a test case which does not stop the spark job. It causes many failures of other test cases. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/2403/consoleFull - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2600/consoleFull ``` [info] org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16943 from dongjoon-hyun/SPARK-19607-2.
* [SPARK-19331][SQL][TESTS] Improve the test coverage of SQLViewSuitejiangxingbo2017-02-155-302/+302
| | | | | | | | | | | | | | | | | Move `SQLViewSuite` from `sql/hive` to `sql/core`, so we can test the view supports without hive metastore. Also moved the test cases that specified to hive to `HiveSQLViewSuite`. Improve the test coverage of SQLViewSuite, cover the following cases: 1. view resolution(possibly a referenced table/view have changed after the view creation); 2. handle a view with user specified column names; 3. improve the test cases for a nested view. Also added a test case for cyclic view reference, which is a known issue that is not fixed yet. N/A Author: jiangxingbo <jiangxb1987@gmail.com> Closes #16674 from jiangxb1987/view-test.
* [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and ColumnFelix Cheung2017-02-152-2/+18
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16739 from felixcheung/rcoalesce.
* [SPARK-16475][SQL] broadcast hint for SQL queries - disallow space as the ↵Liang-Chi Hsieh2017-02-152-3/+8
| | | | | | | | | | | | | | | | | | delimiter ## What changes were proposed in this pull request? A follow-up to disallow space as the delimiter in broadcast hint. ## How was this patch tested? Jenkins test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16941 from viirya/disallow-space-delimiter.
* [SPARK-18872][SQL][TESTS] New test cases for EXISTS subquery (Joins + CTE)Dilip Biswal2017-02-154-0/+933
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds the third and final set of tests for EXISTS subquery. File name | Brief description ------------------------| ----------------- exists-cte.sql |Tests Exist subqueries referencing CTE exists-joins-and-set-ops.sql|Tests Exists subquery used in Joins (Both when joins occurs in outer and suquery blocks) DB2 results are attached here as reference : [exists-cte-db2.txt](https://github.com/apache/spark/files/752091/exists-cte-db2.txt) [exists-joins-and-set-ops-db2.txt](https://github.com/apache/spark/files/753283/exists-joins-and-set-ops-db2.txt) (updated) ## How was this patch tested? The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #16802 from dilipbiswal/exists-pr3.
* [SPARK-18873][SQL][TEST] New test cases for scalar subquery (part 2 of 2) - ↵Nattavut Sutyanyong2017-02-154-66/+701
| | | | | | | | | | | | | | scalar subquery in predicate context ## What changes were proposed in this pull request? This PR adds new test cases for scalar subquery in predicate context ## How was this patch tested? The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct. Author: Nattavut Sutyanyong <nsy.can@gmail.com> Closes #16798 from nsyca/18873-2.
* [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 2nd batchKevin Yu2017-02-158-114/+1023
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is 2nd batch of test case for IN/NOT IN subquery. In this PR, it has these test cases: `in-limit.sql` `in-order-by.sql` `not-in-group-by.sql` These are the queries and results from running on DB2. [in-limit DB2 version](https://github.com/apache/spark/files/743267/in-limit.sql.db2.out.txt) [in-order-by DB2 version](https://github.com/apache/spark/files/743269/in-order-by.sql.db2.txt) [not-in-group-by DB2 version](https://github.com/apache/spark/files/743271/not-in-group-by.sql.db2.txt) [output of in-limit.sql DB2](https://github.com/apache/spark/files/743276/in-limit.sql.db2.out.txt) [output of in-order-by.sql DB2](https://github.com/apache/spark/files/743278/in-order-by.sql.db2.out.txt) [output of not-in-group-by.sql DB2](https://github.com/apache/spark/files/743279/not-in-group-by.sql.db2.out.txt) ## How was this patch tested? This pr is adding new test cases. Author: Kevin Yu <qyu@us.ibm.com> Closes #16759 from kevinyu98/spark-18871-2.
* [SPARK-17076][SQL] Cardinality estimation for join based on basic column ↵Zhenhua Wang2017-02-157-24/+801
| | | | | | | | | | | | | | | | | | | | | statistics ## What changes were proposed in this pull request? Support cardinality estimation and stats propagation for all join types. Limitations: - For inner/outer joins without any equal condition, we estimate it like cartesian product. - For left semi/anti joins, since we can't apply the heuristics for inner join to it, for now we just propagate the statistics from left side. We should support them when other advanced stats (e.g. histograms) are available in spark. ## How was this patch tested? Add a new test suite. Author: Zhenhua Wang <wzh_zju@163.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes #16228 from wzhfy/joinEstimate.
* [SPARK-19587][SQL] bucket sorting columns should not be picked from ↵Wenchen Fan2017-02-153-48/+25
| | | | | | | | | | | | | | | | | | partition columns ## What changes were proposed in this pull request? We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns. This PR also move the checking logic from `DataFrameWriter` to `PreprocessTableCreation`, which is the central place for checking and normailization. ## How was this patch tested? updated test. Author: Wenchen Fan <wenchen@databricks.com> Closes #16931 from cloud-fan/bucket.
* [SPARK-16475][SQL] broadcast hint for SQL queries - follow upReynold Xin2017-02-153-20/+18
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? A small update to https://github.com/apache/spark/pull/16925 1. Rename SubstituteHints -> ResolveHints to be more consistent with rest of the rules. 2. Added more documentation in the rule and be more defensive / future proof to skip views as well as CTEs. ## How was this patch tested? This pull request contains no real logic change and all behavior should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #16939 from rxin/SPARK-16475.
* [SPARK-19607] Finding QueryExecution that matches provided executionIdAla Luszczak2017-02-152-0/+41
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Implementing a mapping between executionId and corresponding QueryExecution in SQLExecution. ## How was this patch tested? Adds a unit test. Author: Ala Luszczak <ala@databricks.com> Closes #16940 from ala/execution-id.
* [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the ↵sureshthalamati2017-02-1413-34/+110
| | | | | | | | | | | | | | | | | | | | user in case-sensitive manner. ## What changes were proposed in this pull request? The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner. This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case. This PR enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection. Alternative approach PR https://github.com/apache/spark/pull/16847 is to pass original input keys to JDBC data source by adding check in the Data source class and handle case-insensitivity in the JDBC source code. ## How was this patch tested? Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.
* [SPARK-16475][SQL] Broadcast hint for SQL QueriesReynold Xin2017-02-1410-4/+340
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure. The hint syntax looks like the following: ``` SELECT /*+ BROADCAST(t) */ * FROM t ``` For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name. The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions. Note that there was an earlier patch in https://github.com/apache/spark/pull/14426. This is a rewrite of that patch, with different semantics and simpler test cases. ## How was this patch tested? Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite. Author: Reynold Xin <rxin@databricks.com> Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16925 from rxin/SPARK-16475-broadcast-hint.
* [SPARK-19589][SQL] Removal of SQLGEN filesXiao Li2017-02-14126-654/+0
| | | | | | | | | | | | ### What changes were proposed in this pull request? SQLGen is removed. Thus, the generated files should be removed too. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #16921 from gatorsmile/removeSQLGenFiles.
* [SPARK-19539][SQL] Block duplicate temp table during creationXin Wu2017-02-1310-137/+160
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Current `CREATE TEMPORARY TABLE ... ` is deprecated and recommend users to use `CREATE TEMPORARY VIEW ...` And it does not support `IF NOT EXISTS `clause. However, if there is an existing temporary view defined, it is possible to unintentionally replace this existing view by issuing `CREATE TEMPORARY TABLE ...` with the same table/view name. This PR is to disallow `CREATE TEMPORARY TABLE ...` with an existing view name. Under the cover, `CREATE TEMPORARY TABLE ...` will be changed to create temporary view, however, passing in a flag `replace=false`, instead of currently `true`. So when creating temporary view under the cover, if there is existing view with the same name, the operation will be blocked. ## How was this patch tested? New unit test case is added and updated some existing test cases to adapt the new behavior Author: Xin Wu <xinwu@us.ibm.com> Closes #16878 from xwu0226/block_duplicate_temp_table.
* [SPARK-19115][SQL] Supporting Create Table Like Locationouyangxiaochen2017-02-135-37/+159
| | | | | | | | | | | | | | | | | | What changes were proposed in this pull request? Support CREATE [EXTERNAL] TABLE LIKE LOCATION... syntax for Hive serde and datasource tables. In this PR,we follow SparkSQL design rules : supporting create table like view or physical table or temporary view with location. creating a table with location,this table will be an external table other than managed table. How was this patch tested? Add new test cases and update existing test cases Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn> Closes #16868 from ouyangxiaochen/spark19115.
* [SPARK-19435][SQL] Type coercion between ArrayTypeshyukjinkwon2017-02-132-43/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR proposes to support type coercion between `ArrayType`s where the element types are compatible. **Before** ``` Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))") org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0; Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))") org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0; sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)") org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14 Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b")) org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;; sql("SELECT IF(1=1, array(1), array(1D))") org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7; ``` **After** ```scala Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))") res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>] Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))") res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>] sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)") res8: org.apache.spark.sql.DataFrame = [a: array<double>] Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b")) res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>] sql("SELECT IF(1=1, array(1), array(1D))") res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>] ``` ## How was this patch tested? Unit tests in `TypeCoercion` and Jenkins tests and building with scala 2.10 ```scala ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16777 from HyukjinKwon/SPARK-19435.
* [SPARK-19542][SS] Delete the temp checkpoint if a query is stopped without ↵Shixiong Zhu2017-02-133-3/+53
| | | | | | | | | | | | | | | | errors ## What changes were proposed in this pull request? When a query uses a temp checkpoint dir, it's better to delete it if it's stopped without errors. ## How was this patch tested? New unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16880 from zsxwing/delete-temp-checkpoint.