| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
fallback mode.
Author: Reynold Xin <rxin@databricks.com>
Closes #7767 from rxin/SPARK-9462 and squashes the following commits:
ef3e2d9 [Reynold Xin] Removed println
713ac3a [Reynold Xin] More unit tests.
bb5c334 [Reynold Xin] [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation.
I also removed the FLOAT prefix, opting for converting float directly to double.
Author: Reynold Xin <rxin@databricks.com>
Closes #7763 from rxin/sort-prefix and squashes the following commits:
5dc2f06 [Reynold Xin] [SPARK-9458] Avoid object allocation in prefix generation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
across instances.
We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems for expressions with mutable state.
This pull request fixed that problem, and also added unit tests for all codegen classes, except GeneratedOrdering (which will never need any expressions since sort now only accepts bound references.
Author: Reynold Xin <rxin@databricks.com>
Closes #7759 from rxin/SPARK-9448 and squashes the following commits:
c09b50f [Reynold Xin] [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads. These sizes are now controlled by a new configuration, `spark.buffer.pageSize`. The new default is 64 megabytes.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7741 from JoshRosen/SPARK-9411 and squashes the following commits:
a43c4db [Josh Rosen] Fix pow
2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
bc734f0 [Josh Rosen] Rename configuration
e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds,
so interval can be compared.
Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future.
Author: Reynold Xin <rxin@databricks.com>
Closes #7745 from rxin/calendarintervaltype and squashes the following commits:
99f64e8 [Reynold Xin] One more line ...
13466c8 [Reynold Xin] Fixed tests.
e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #7747 from rxin/SPARK-9127 and squashes the following commits:
e851418 [Reynold Xin] [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7593 from cloud-fan/sort and squashes the following commits:
7b1bef7 [Wenchen Fan] add test
daf206d [Wenchen Fan] add more comments
289bee0 [Wenchen Fan] do not order by expressions which still need evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double.
This PR change to parse float number to decimal or double, based on it's using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx
This is a break change, should we doc it somewhere?
Author: Davies Liu <davies@databricks.com>
Closes #7642 from davies/parse_decimal and squashes the following commits:
1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
5e142b6 [Davies Liu] fix scala style
eca99de [Davies Liu] fix tests
2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
f4a320b [Davies Liu] Update SqlParser.scala
1c48e34 [Davies Liu] use decimal or double when parsing SQL
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-9398
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes #7725 from yjshen/date_null_check and squashes the following commits:
b4eade1 [Yijie Shen] inline daysToMonthEnd
d09acc1 [Yijie Shen] implement getLastDayOfMonth to avoid repeated evaluation
d857ec3 [Yijie Shen] add null check in DateExpressionSuite
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
nondeterministic expression before evaluation and fix PullOutNondeterministic
We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it.
Fix `PullOutNondeterministic` rule to make it work for `Sort`.
Also got a chance to cleanup the dataframe test suite.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7674 from cloud-fan/show and squashes the following commits:
888934f [Wenchen Fan] fix sort
c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`
|
|
|
|
|
|
|
|
|
|
|
|
| |
buffers
https://issues.apache.org/jira/browse/SPARK-9422
Author: Yin Huai <yhuai@databricks.com>
Closes #7737 from yhuai/removePlaceHolder and squashes the following commits:
ec29b44 [Yin Huai] Remove placeholder attributes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
get(ordinal, dataType)
UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0).
In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors.
This patch addresses both of these issues and adds a regression test.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits:
c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite
6214682 [Josh Rosen] Fixes to null handling in UnsafeRow
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator.
Author: Reynold Xin <rxin@databricks.com>
Closes #7733 from rxin/smj and squashes the following commits:
61e4d34 [Reynold Xin] Fixed test case.
5ffd731 [Reynold Xin] Fixed JoinSuite.
a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since catalyst package already depends on Spark core, we can move those expressions
into catalyst, and simplify function registry.
This is a followup of #7478.
Author: Reynold Xin <rxin@databricks.com>
Closes #7735 from rxin/SPARK-8003 and squashes the following commits:
2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SparkSQL's ScriptTransform operator has several serious bugs which make debugging fairly difficult:
- If exceptions are thrown in the writing thread then the child process will not be killed, leading to a deadlock because the reader thread will block while waiting for input that will never arrive.
- TaskContext is not propagated to the writer thread, which may cause errors in upstream pipelined operators.
- Exceptions which occur in the writer thread are not propagated to the main reader thread, which may cause upstream errors to be silently ignored instead of killing the job. This can lead to silently incorrect query results.
- The writer thread is not a daemon thread, but it should be.
In addition, the code in this file is extremely messy:
- Lots of fields are nullable but the nullability isn't clearly explained.
- Many confusing variable names: for instance, there are variables named `ite` and `iterator` that are defined in the same scope.
- Some code was misindented.
- The `*serdeClass` variables are actually expected to be single-quoted strings, which is really confusing: I feel that this parsing / extraction should be performed in the analyzer, not in the operator itself.
- There were no unit tests for the operator itself, only end-to-end tests.
This pull request addresses these issues, borrowing some error-handling techniques from PySpark's PythonRDD.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7710 from JoshRosen/script-transform and squashes the following commits:
16c44e2 [Josh Rosen] Update some comments
983f200 [Josh Rosen] Use unescapeSQLString instead of stripQuotes
6a06a8c [Josh Rosen] Clean up handling of quotes in serde class name
494cde0 [Josh Rosen] Propagate TaskContext to writer thread
323bb2b [Josh Rosen] Fix error-swallowing bug
b31258d [Josh Rosen] Rename iterator variables to disambiguate.
88278de [Josh Rosen] Split ScriptTransformation writer thread into own class.
8b162b6 [Josh Rosen] Add failing test which demonstrates exception masking issue
4ee36a2 [Josh Rosen] Kill script transform subprocess when error occurs in input writer.
bd4c948 [Josh Rosen] Skip launching of external command for empty partitions.
b43e4ec [Josh Rosen] Clean up nullability in ScriptTransformation
fa18d26 [Josh Rosen] Add basic unit test for script transform with 'cat' command.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR introduce BytesToBytesMap to UnsafeHashedRelation, use it in executor for better performance.
It serialize all the key and values from java HashMap, put them into a BytesToBytesMap while deserializing. All the values for a same key are stored continuous to have better memory locality.
This PR also address the comments for #7480 , do some clean up.
Author: Davies Liu <davies@databricks.com>
Closes #7592 from davies/unsafe_map2 and squashes the following commits:
42c578a [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_map2
fd09528 [Davies Liu] remove thread local cache and update docs
1c5ad8d [Davies Liu] fix test
5eb1b5a [Davies Liu] address comments in #7480
46f1f22 [Davies Liu] fix style
fc221e0 [Davies Liu] use BytesToBytesMap for broadcast join
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added virtual column support by adding a new resolution role to the query analyzer. Additional virtual columns can be added by adding case expressions to [the new rule](https://github.com/JDrit/spark/blob/virt_columns/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L1026) and my modifying the [logical plan](https://github.com/JDrit/spark/blob/virt_columns/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L216) to resolve them.
This also solves [SPARK-8003](https://issues.apache.org/jira/browse/SPARK-8003)
This allows you to perform queries such as:
```sql
select spark__partition__id, count(*) as c from table group by spark__partition__id;
```
Author: Joseph Batchik <josephbatchik@gmail.com>
Author: JD <jd@csh.rit.edu>
Closes #7478 from JDrit/virt_columns and squashes the following commits:
7932bf0 [Joseph Batchik] adding spark__partition__id to hive as well
f8a9c6c [Joseph Batchik] merging in master
e49da48 [JD] fixes for @rxin's suggestions
60e120b [JD] fixing test in merge
4bf8554 [JD] merging in master
c68bc0f [Joseph Batchik] Adding function register ability to SQLContext and adding a function for spark__partition__id()
|
|
|
|
|
|
|
|
|
|
|
|
| |
current_timestamp.
This test is flaky. https://issues.apache.org/jira/browse/SPARK-9196 will track the fix of it. For now, let's disable this test.
Author: Yin Huai <yhuai@databricks.com>
Closes #7727 from yhuai/SPARK-9196-ignore and squashes the following commits:
f92bded [Yin Huai] Ignore current_timestamp.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
applicable
Certain applications would benefit from being able to inspect DataFrames that are straightforwardly produced by data sources that stem from files, and find out their source data. For example, one might want to display to a user the size of the data underlying a table, or to copy or mutate it.
This PR exposes an `inputFiles` method on DataFrame which attempts to discover the source data in a best-effort manner, by inspecting HadoopFsRelations and JSONRelations.
Author: Aaron Davidson <aaron@databricks.com>
Closes #7717 from aarondav/paths and squashes the following commits:
ff67430 [Aaron Davidson] inputFiles
0acd3ad [Aaron Davidson] [SPARK-9397] DataFrame should provide an API to find source data files if applicable
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original patch didn't handle nulls correctly for next_day.
Author: Reynold Xin <rxin@databricks.com>
Closes #7718 from rxin/next_day and squashes the following commits:
616a425 [Reynold Xin] Merged DatetimeExpressionsSuite into DateFunctionsSuite.
faa78cf [Reynold Xin] Merged DatetimeFunctionsSuite into DateExpressionsSuite.
6c4fb6a [Reynold Xin] [SPARK-8196][SQL] Fix null handling & documentation for next_day.
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #7720 from rxin/struct-followup and squashes the following commits:
d9757f5 [Reynold Xin] [SPARK-9373][SQL] follow up for StructType support in Tungsten projection.
|
|
|
|
|
|
|
|
|
|
| |
Both expressions already implement code generation.
Author: Reynold Xin <rxin@databricks.com>
Closes #7723 from rxin/abs-formatnum and squashes the following commits:
31ed765 [Reynold Xin] [SPARK-9402][SQL] Remove CodegenFallback from Abs / FormatNumber.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Our CodeFormatter currently does not handle parentheses, and as a result in code dump, we see code formatted this way:
```
foo(
a,
b,
c)
```
With this patch, it is formatted this way:
```
foo(
a,
b,
c)
```
Author: Reynold Xin <rxin@databricks.com>
Closes #7712 from rxin/codeformat-parentheses and squashes the following commits:
c2b1c5f [Reynold Xin] Took square bracket out
3cfb174 [Reynold Xin] Code review feedback.
91f5bb1 [Reynold Xin] [SPARK-9394][SQL] Handle parentheses in CodeFormatter.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is actually contains 3 minor issues:
1) Enable the unit test(codegen) for mutable expressions (FormatNumber, Regexp_Replace/Regexp_Extract)
2) Use the `PlatformDependent.copyMemory` instead of the `System.arrayCopy`
Author: Cheng Hao <hao.cheng@intel.com>
Closes #7566 from chenghao-intel/codegen_ut and squashes the following commits:
24f43ea [Cheng Hao] enable codegen for mutable expression & UTF8String performance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request updates GenerateUnsafeProjection to support StructType. If an input struct type is backed already by an UnsafeRow, GenerateUnsafeProjection copies the bytes directly into its buffer space without any conversion. However, if the input is not an UnsafeRow, GenerateUnsafeProjection runs the code generated recursively to convert the input into an UnsafeRow and then copies it into the buffer space.
Also create a TungstenProject operator that projects data directly into UnsafeRow. Note that I'm not sure if this is the way we want to structure Unsafe+codegen operators, but we can defer that decision to follow-up pull requests.
Author: Reynold Xin <rxin@databricks.com>
Closes #7689 from rxin/tungsten-struct-type and squashes the following commits:
9162f42 [Reynold Xin] Support IntervalType in UnsafeRow's getter.
be9f377 [Reynold Xin] Fixed tests.
10c4b7c [Reynold Xin] Format generated code.
77e8d0e [Reynold Xin] Fixed NondeterministicSuite.
ac4951d [Reynold Xin] Yay.
ac203bf [Reynold Xin] More comments.
9f36216 [Reynold Xin] Updated comment.
6b781fe [Reynold Xin] Reset the change in DataFrameSuite.
525b95b [Reynold Xin] Merged with master, more documentation & test cases.
321859a [Reynold Xin] [SPARK-9373][SQL] Support StructType in Tungsten projection [WIP]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-8828
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes #7667 from yjshen/revert_combinesum_2 and squashes the following commits:
c37ccb1 [Yijie Shen] add test case
8377214 [Yijie Shen] revert spark.sql.useAggregate2 to its default value
e2305ac [Yijie Shen] fix bug - avg on decimal column
7cb0e95 [Yijie Shen] [wip] resolving bugs
1fadb5a [Yijie Shen] remove occurance
17c6248 [Yijie Shen] revert SPARK-5680
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
specialized getters.
As we are adding more and more specialized getters to more classes (coming soon ArrayData), this interface can help us prevent missing a method in some interfaces.
Author: Reynold Xin <rxin@databricks.com>
Closes #7713 from rxin/SpecializedGetters and squashes the following commits:
3b39be1 [Reynold Xin] Added override modifier.
567ba9c [Reynold Xin] [SPARK-9395][SQL] Create a SpecializedGetters interface to track all the specialized getters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
next_day, returns next certain dayofweek.
last_day, returns the last day of the month which given date belongs to.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #6986 from adrian-wang/udfnlday and squashes the following commits:
ef7e3da [Daoyuan Wang] fix
02b3426 [Daoyuan Wang] address 2 comments
dc69630 [Daoyuan Wang] address comments from rxin
8846086 [Daoyuan Wang] address comments from rxin
d09bcce [Daoyuan Wang] multi fix
1a9de3d [Daoyuan Wang] function next_day and last_day
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we have been seeing a lot of failures related to this new feature, lets put it behind a flag and turn it off by default.
Author: Michael Armbrust <michael@databricks.com>
Closes #7703 from marmbrus/optionalMetastorePruning and squashes the following commits:
6ad128c [Michael Armbrust] style
8447835 [Michael Armbrust] [SPARK-9386][SQL] Feature flag for metastore partition pruning
fd37b87 [Michael Armbrust] add config flag
|
|
|
|
|
|
|
|
|
|
|
| |
cache code
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7673 from cloud-fan/row-generic-getter-columnar and squashes the following commits:
88b1170 [Wenchen Fan] fix style
eeae712 [Wenchen Fan] Remove Internal.get generic getter call in columnar cache code
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a proper version of PR #7693 authored by viirya
The reason why "CTAS with serde" fails is that the `MetastoreRelation` gets converted to a Parquet data source relation by default.
Author: Cheng Lian <lian@databricks.com>
Closes #7700 from liancheng/spark-9378-fix-ctas-test and squashes the following commits:
4413af0 [Cheng Lian] Fixes test case "CTAS with serde"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-9349
With this PR, we only expose `UserDefinedAggregateFunction` (an abstract class) and `MutableAggregationBuffer` (an interface). Other internal wrappers and helper classes are moved to `org.apache.spark.sql.execution.aggregate` and marked as `private[sql]`.
Author: Yin Huai <yhuai@databricks.com>
Closes #7687 from yhuai/UDAF-cleanup and squashes the following commits:
db36542 [Yin Huai] Add comments to UDAF examples.
ae17f66 [Yin Huai] Address comments.
9c9fa5f [Yin Huai] UDAF cleanup.
|
|
|
|
|
|
|
|
|
|
| |
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7688 from cloud-fan/interval and squashes the following commits:
5b36b17 [Wenchen Fan] fix codegen
a99ed50 [Wenchen Fan] address comment
9e6d319 [Wenchen Fan] Support IntervalType in UnsafeRow
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
literals in grouping expressions have no effect at all, only make our grouping key bigger, so we should remove them in Optimizer.
I also make old and new aggregation code consistent about literals in grouping here. In old aggregation, actually literals in grouping are already removed but new aggregation is not. So I explicitly make it a rule in Optimizer.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7583 from cloud-fan/minor and squashes the following commits:
471adff [Wenchen Fan] add test
0839925 [Wenchen Fan] use transformDown when rewrite final result expressions
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make this test deterministic, i.e. make sure this test can be passed no matter how many times we run it.
The origin implementation uses a random seed and gives a chance that we may break the null check assertion `assert(Iterator.fill(100)(generator()).contains(null))`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7691 from cloud-fan/seed and squashes the following commits:
eae7281 [Wenchen Fan] use a seed in RandomDataGeneratorSuite
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
UnsafeExternalSorter
This patch fixes two bugs in UnsafeExternalSorter and UnsafeExternalRowSorter:
- UnsafeExternalSorter does not properly update freeSpaceInCurrentPage, which can cause it to write past the end of memory pages and trigger segfaults.
- UnsafeExternalRowSorter has a use-after-free bug when returning the last row from an iterator.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7680 from JoshRosen/SPARK-9364 and squashes the following commits:
590f311 [Josh Rosen] null out row
f4cf91d [Josh Rosen] Fix use-after-free bug in UnsafeExternalRowSorter.
8abcf82 [Josh Rosen] Properly decrement freeSpaceInCurrentPage in UnsafeExternalSorter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR is based on #6796 authored by rtreffer.
To support large decimal precisions (> 18), we do the following things in this PR:
1. Making `CatalystSchemaConverter` support large decimal precision
Decimal types with large precision are always converted to fixed-length byte array.
2. Making `CatalystRowConverter` support reading decimal values with large precision
When the precision is > 18, constructs `Decimal` values with an unscaled `BigInteger` rather than an unscaled `Long`.
3. Making `RowWriteSupport` support writing decimal values with large precision
In this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848).
Two follow-up tasks should be done in future PRs:
- [ ] Writing decimals as `INT32`, `INT64` when possible while fixing SPARK-8848
- [ ] Adding compatibility tests as part of SPARK-5463
Author: Cheng Lian <lian@databricks.com>
Closes #7455 from liancheng/spark-4176 and squashes the following commits:
a543d10 [Cheng Lian] Fixes errors introduced while rebasing
9e31cdf [Cheng Lian] Supports decimals with precision > 18 for Parquet
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
multi-database support
This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally.
Author: Cheng Lian <lian@databricks.com>
Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits:
f3bcd4b [Cheng Lian] Addresses PR comments
e0eb76a [Cheng Lian] Fixes styling issues
41e2207 [Cheng Lian] Fixes multi-database support
d4d1ec2 [Cheng Lian] Adds multi-database test cases
|
|
|
|
|
|
|
|
|
|
| |
context
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7684 from cloud-fan/hive and squashes the following commits:
da21ffe [Wenchen Fan] fix the support for special chars in column names for hive context
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #7682 from rxin/unsaferow-generic-getter and squashes the following commits:
3063788 [Reynold Xin] Reset the change for real this time.
0f57c55 [Reynold Xin] Reset the changes in ExpressionEvalHelper.
fb6ca30 [Reynold Xin] Support BinaryType.
24a3e46 [Reynold Xin] Added support for DateType/TimestampType.
9989064 [Reynold Xin] JoinedRow.
11f80a3 [Reynold Xin] [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow.
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-9306
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #7645 from viirya/smj_unsortable and squashes the following commits:
a240707 [Liang-Chi Hsieh] Use forall instead of exists for readability.
55221fa [Liang-Chi Hsieh] Shouldn't use SortMergeJoin when joining on unsortable columns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As Hive does, we need to list all of the registered UDF and its usage for user.
We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF.
e.g.
```scala
ExpressionDescription(
usage = "_FUNC_(expr) - Returns the absolute value of the numeric value",
extended = """> SELECT _FUNC_('-1')
1""")
case class Abs(child: Expression) extends UnaryArithmetic {
...
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #7259 from chenghao-intel/desc_function and squashes the following commits:
cf29bba [Cheng Hao] fixing the code style issue
5193855 [Cheng Hao] Add more powerful parser for show functions
c645a6b [Cheng Hao] fix bug in unit test
78d40f1 [Cheng Hao] update the padding issue for usage
48ee4b3 [Cheng Hao] update as feedback
70eb4e9 [Cheng Hao] add show/describe function support
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR removes the old Parquet support:
- Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites.
- Renames `ParquetRelation2` to `ParquetRelation`
- Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files.
This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow.
There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848.
Author: Cheng Lian <lian@databricks.com>
Closes #7441 from liancheng/spark-9095 and squashes the following commits:
c7b6e38 [Cheng Lian] Removes WriteToFile
2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation
ca9e1b7 [Cheng Lian] Removes old Parquet support
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-9356
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes #7671 from yjshen/deprecated_unlimit and squashes the following commits:
c707f56 [Yijie Shen] remove pattern matching in changePrecision
4a1823c [Yijie Shen] remove internal occurrence of Decimal.Unlimited
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
integration code.
Replaced them with get(ordinal, datatype) so we can use UnsafeRow here.
I passed the data types throughout.
Author: Reynold Xin <rxin@databricks.com>
Closes #7669 from rxin/row-generic-getter-hive and squashes the following commits:
3467d8e [Reynold Xin] [SPARK-9354][SQL] Remove Internal.get generic getter call in Hive integration code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DataType
Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter.
Author: Reynold Xin <rxin@databricks.com>
Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits:
ee2874c [Reynold Xin] Add a default implementation for getStruct.
1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType.
033ee88 [Reynold Xin] Removed getAs in non test code.
|
|
|
|
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #7665 from rxin/remove-row-apply and squashes the following commits:
0b43001 [Reynold Xin] support getString in UnsafeRow.
176d633 [Reynold Xin] apply -> get.
2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently nondeterministic expression is broken without a explicit initialization phase.
Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add the explicit initialization phase.
Another use case is local execution for `LocalRelation`, there is no serialize and deserialize phase and thus we can't reset mutable states for it.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #7535 from cloud-fan/init and squashes the following commits:
6c6f332 [Wenchen Fan] add test
ef68ff4 [Wenchen Fan] fix comments
9eac85e [Wenchen Fan] move init code to interpreted class
bb7d838 [Wenchen Fan] pulls out nondeterministic expressions into a project
b4a4fc7 [Wenchen Fan] revert a refactor
86fee36 [Wenchen Fan] add initialization phase for nondeterministic expression
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up of #7626. It fixes `Row`/`InternalRow` conversion for data sources extending `HadoopFsRelation` with `needConversion` being `true`.
Author: Cheng Lian <lian@databricks.com>
Closes #7649 from liancheng/spark-9285-conversion-fix and squashes the following commits:
036a50c [Cheng Lian] Addresses PR comment
f6d7c6a [Cheng Lian] Fixes Row/InternalRow conversion for HadoopFsRelation
|