| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
@ExpressionDescription
Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit``
The policy is here, as describe at https://github.com/apache/spark/pull/10488
Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #10524 from kiszk/SPARK-12580.
|
|
|
|
|
|
|
|
|
|
| |
It was introduced in 917d3fc069fb9ea1c1487119c9c12b373f4f9b77
/cc cloud-fan rxin
Author: Jacek Laskowski <jacek@japila.pl>
Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.
|
|
|
|
|
|
|
|
|
|
|
|
| |
splits
https://issues.apache.org/jira/browse/SPARK-12662
cc yhuai
Author: Sameer Agarwal <sameer@databricks.com>
Closes #10626 from sameeragarwal/randomsplit.
|
|
|
|
|
|
|
|
| |
Parse the SQL query with except/intersect in FROM clause for HivQL.
Author: Davies Liu <davies@databricks.com>
Closes #10622 from davies/intersect.
|
|
|
|
|
|
|
|
|
|
| |
This PR manage the memory used by window functions (buffered rows), also enable external spilling.
After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G.
Author: Davies Liu <davies@databricks.com>
Closes #10605 from davies/unsafe_window.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
scan benchmarks.
[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.
We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
this.
Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>
Closes #10589 from nongli/spark-12640.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example:
```
df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales")
```
When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write.
Note that there may be multiply files for one bucket, as the data is distributed.
Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway.
Limitations:
* Can't write bucketed data without hive metastore.
* Can't insert bucketed data into existing hive tables.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10498 from cloud-fan/bucket-write.
|
|
|
|
|
|
|
|
|
|
| |
To avoid to have a huge Java source (over 64K loc), that can't be compiled.
cc hvanhovell
Author: Davies Liu <davies@databricks.com>
Closes #10624 from davies/split_ident.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made:
The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling.
The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project:
- ```CatalystQl```: This implements Query and Expression parsing functionality.
- ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe.
- ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive.
cc rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10583 from hvanhovell/SPARK-12575.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
exactly the same grouping expressi
For queries like :
select <> from table group by a distribute by a
we can eliminate distribute by ; since group by will anyways do a hash partitioning
Also applicable when user uses Dataframe API
Author: Yash Datta <Yash.Datta@guavus.com>
Closes #9858 from saucam/eliminatedistribute.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and AsyncRDDActions.takeAsync
I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem.
spark jira
https://issues.apache.org/jira/browse/SPARK-12340
Author: QiangCai <david.caiq@gmail.com>
Closes #10562 from QiangCai/bugfix.
|
|
|
|
|
|
|
|
|
|
|
|
| |
aggregate function with OVER clause
JIRA: https://issues.apache.org/jira/browse/SPARK-12578
Slightly update to Hive parser. We should keep the distinct keyword when used in an aggregate function with OVER clause. So the CheckAnalysis will detect it and throw exception later.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #10557 from viirya/keep-distinct-hivesql.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #10582 from vanzin/SPARK-3873-tests.
|
|
|
|
|
|
|
|
|
|
|
|
| |
JDBC data sources.
This fix masks JDBC credentials in the explain output. URL patterns to specify credential seems to be vary between different databases. Added a new method to dialect to mask the credentials according to the database specific URL pattern.
While adding tests I noticed explain output includes array variable for partitions ([Lorg.apache.spark.Partition;3ff74546,). Modified the code to include the first, and last partition information.
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes #10452 from sureshthalamati/mask_jdbc_credentials_spark-12504.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #10573 from vanzin/SPARK-3873-sql.
|
|
|
|
|
|
|
|
|
|
| |
files directly.
As noted in the code, this change is to make this component easier to test in isolation.
Author: Nong <nongli@gmail.com>
Closes #10581 from nongli/spark-12636.
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-12439
In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type.
There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null).
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #10391 from viirya/fix-catalystarray.
|
|
|
|
|
|
|
|
| |
I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List).
Author: Reynold Xin <rxin@databricks.com>
Closes #10569 from rxin/SPARK-12615.
|
|
|
|
|
|
|
|
|
|
| |
address comments in #10435
This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10588 from cloud-fan/hash.
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-12438
ScalaReflection lacks the support of SQLUserDefinedType. We should add it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #10390 from viirya/encoder-udt.
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #10516 from marmbrus/datasetCleanup.
|
|
|
|
|
|
|
|
| |
This addresses davies' code review feedback in https://github.com/apache/spark/pull/10559
Author: Reynold Xin <rxin@databricks.com>
Closes #10586 from rxin/remove-deprecated-sql-followup.
|
|
|
|
|
|
|
|
|
|
| |
group of expressions
just write the arguments into unsafe row and use murmur3 to calculate hash code
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10435 from cloud-fan/hash-expr.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10559 from rxin/remove-deprecated-sql.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov:
- "Currently cov supports calculating the covariance between two columns"
- "Covariance calculation for columns with dataType "[DataType Name]" not supported."
I've fixed this issue by passing the function name as an argument. We could also do the input checks separately for each function. I avoided doing that because of code duplication.
Thanks!
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes #10458 from NarineK/sparksqlstatsmessages.
|
|
|
|
|
|
|
|
|
|
|
|
| |
length.
The reader was previously not setting the row length meaning it was wrong if there were variable
length columns. This problem does not manifest usually, since the value in the column is correct and
projecting the row fixes the issue.
Author: Nong Li <nong@databricks.com>
Closes #10576 from nongli/spark-12589.
|
|
|
|
|
|
|
|
|
|
|
| |
This PR enable cube/rollup as function, so they can be used as this:
```
select a, b, sum(c) from t group by rollup(a, b)
```
Author: Davies Liu <davies@databricks.com>
Closes #10522 from davies/rollup.
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem.
This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1.
cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation).
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10553 from hvanhovell/SPARK-12421.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark
Michael suggested fixing the doc.
Please review.
Author: tedyu <yuzhihong@gmail.com>
Closes #10499 from ted-yu/master.
|
|
|
|
|
|
| |
Author: Xiu Guo <xguo27@gmail.com>
Closes #10500 from xguo27/SPARK-12512.
|
|
|
|
|
|
|
|
| |
also only allocate required buffer size
Author: Pete Robbins <robbinspg@gmail.com>
Closes #10421 from robbinspg/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.
In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.
This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).
If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).
This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10519 from JoshRosen/jdbc-driver-precedence.
|
|
|
|
|
|
|
|
| |
be called value
Author: Xiu Guo <xguo27@gmail.com>
Closes #10515 from xguo27/SPARK-12562.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
quoting mechanism
We can provides the option to choose JSON parser can be enabled to accept quoting of all character or not.
Author: Cazen <Cazen@korea.com>
Author: Cazen Lee <cazen.lee@samsung.com>
Author: Cazen Lee <Cazen@korea.com>
Author: cazen.lee <cazen.lee@samsung.com>
Closes #10497 from Cazen/master.
|
|
|
|
|
|
|
|
| |
Avoiding the the No such table exception and throwing analysis exception as per the bug: SPARK-12533
Author: thomastechs <thomas.sebastian@tcs.com>
Closes #10529 from thomastechs/topic-branch.
|
|
|
|
|
|
| |
always output UnsafeRow""
This reverts commit 44ee920fd49d35b421ae562ea99bcc8f2b98ced6.
|
|
|
|
|
|
|
|
| |
callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that.
Author: Reynold Xin <rxin@databricks.com>
Closes #10547 from rxin/SPARK-12599.
|
|
|
|
|
|
|
|
|
|
| |
and reflection that supported 1.x
Remove use of deprecated Hadoop APIs now that 2.2+ is required
Author: Sean Owen <sowen@cloudera.com>
Closes #10446 from srowen/SPARK-12481.
|
|
|
|
|
|
|
|
|
|
| |
This PR is followed by https://github.com/apache/spark/pull/8391.
Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC datasource. This PR fixes the problem that it can actually return null as a result of the comparison resulting error as using the value of that comparison.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: HyukjinKwon <gurwls223@gmail.com>
Closes #8743 from HyukjinKwon/SPARK-10180.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR inlines the Hive SQL parser in Spark SQL.
The previous (merged) incarnation of this PR passed all tests, but had and still has problems with the build. These problems are caused by a the fact that - for some reason - in some cases the ANTLR generated code is not included in the compilation fase.
This PR is a WIP and should not be merged until we have sorted out the build issues.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>
Closes #10525 from hvanhovell/SPARK-12362.
|
|
|
|
|
|
| |
output UnsafeRow"
This reverts commit 0da7bd50ddf0fb9e0e8aeadb9c7fb3edf6f0ee6e.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UnsafeRow
It's confusing that some operator output UnsafeRow but some not, easy to make mistake.
This PR change to only output UnsafeRow for all the operators (SparkPlan), removed the rule to insert Unsafe/Safe conversions. For those that can't output UnsafeRow directly, added UnsafeProjection into them.
Closes #10330
cc JoshRosen rxin
Author: Davies Liu <davies@databricks.com>
Closes #10511 from davies/unsafe_row.
|
|
|
|
|
|
|
|
| |
There's a hack done in `TestHive.reset()`, which intended to mute noisy Hive loggers. However, Spark testing loggers are also muted.
Author: Cheng Lian <lian@databricks.com>
Closes #10540 from liancheng/spark-12592.dont-mute-spark-loggers.
|
|
|
|
|
|
|
|
|
|
|
|
| |
JDBCRDD and add few filters
This patch refactors the filter pushdown for JDBCRDD and also adds few filters.
Added filters are basically from #10468 with some refactoring. Test cases are from #10468.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #10470 from viirya/refactor-jdbc-filter.
|
|
|
|
|
|
|
|
| |
A following pr for #9712. Move the test for arrayOfUDT.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #10538 from viirya/move-udt-test.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Parquet relation with decimal column
https://issues.apache.org/jira/browse/SPARK-12039
since we do not support hadoop1, we can re-enable this test in master.
Author: Yin Huai <yhuai@databricks.com>
Closes #10533 from yhuai/SPARK-12039-enable.
|
|
|
|
|
|
|
|
|
|
| |
Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy.
It should be part of constructor of UnsafeRow.
Author: Davies Liu <davies@databricks.com>
Closes #10528 from davies/numFields.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(docs & tests)
This PR is a follow-up for PR https://github.com/apache/spark/pull/9819. It adds documentation for the window functions and a couple of NULL tests.
The documentation was largely based on the documentation in (the source of) Hive and Presto:
* https://prestodb.io/docs/current/functions/window.html
* https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts?
cc yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #10402 from hvanhovell/SPARK-8641-docs.
|
|
|
|
|
|
|
|
|
|
| |
push-down filters for JDBC
This is rework from #10386 and add more tests and LIKE push-down support.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes #10468 from maropu/SupportMorePushdownInJdbc.
|
|
|
|
|
|
|
|
|
|
| |
Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault.
This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10443 from cloud-fan/encoder.
|