| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.
Author: Sean Owen <sowen@cloudera.com>
Closes #3157 from srowen/SPARK-4297 and squashes the following commits:
8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.
In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.
Author: jbencook <jbenjamincook@gmail.com>
Author: J. Benjamin Cook <jbenjamincook@gmail.com>
Closes #3764 from jbencook/master and squashes the following commits:
6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
|
|
|
|
|
|
|
|
|
|
| |
Minor fix for an obvious scala doc error.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #3751 from viirya/fix_scaladoc and squashes the following commits:
03fddaa [Liang-Chi Hsieh] Fix scala doc.
|
|
|
|
|
|
|
|
|
|
| |
HiveInspectors.scala failed in compiling with Hadoop 1, as the BytesWritable.copyBytes is not available in Hadoop 1.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3742 from chenghao-intel/settable_oi_hotfix and squashes the following commits:
bb04d1f [Cheng Hao] hot fix for ByteWritables.copyBytes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove ```Command``` and use ```RunnableCommand``` instead.
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes #3712 from scwf/cmd and squashes the following commits:
51a82f2 [wangfei] fix test failure
0e03be8 [wangfei] address comments
4033bed [scwf] remove CreateTableAsSelect in hivestrategy
5d20010 [wangfei] address comments
125f542 [scwf] factory command in spark sql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hive UDAF may create an customized object constructed by SettableStructObjectInspector, this is critical when integrate Hive UDAF with the refactor-ed UDAF interface.
Performance issue in `wrap/unwrap` since more match cases added, will do it in another PR.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3429 from chenghao-intel/settable_oi and squashes the following commits:
9f0aff3 [Cheng Hao] update code style issues as feedbacks
2b0561d [Cheng Hao] Add more scala doc
f5a40e8 [Cheng Hao] add scala doc
2977e9b [Cheng Hao] remove the timezone setting for test suite
3ed284c [Cheng Hao] fix the date type comparison
f1b6749 [Cheng Hao] Update the comment
932940d [Cheng Hao] Add more unit test
72e4332 [Cheng Hao] Add settable StructObjectInspector support
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adding support to the partial aggregation of SumDistinct
Author: ravipesala <ravindra.pesala@huawei.com>
Closes #3348 from ravipesala/SPARK-2554 and squashes the following commits:
fd28e4d [ravipesala] Fixed review comments
e60e67f [ravipesala] Fixed test cases and made it as nullable
32fe234 [ravipesala] Supporting SumDistinct partial aggregation Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
empty AttributeSet() references
The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).
Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Closes #3556 from YanTangZhai/SPARK-4693 and squashes the following commits:
620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
72accf1 [YanTangZhai] Update HiveQuerySuite.scala
e572b9a [YanTangZhai] Update HiveStrategies.scala
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
expire
**sessionToActivePool** in **SparkSQLOperationManager** grow infinitely, even as sessions expire.
we should remove the pool value when the session closed, even though **sessionToActivePool** would not exist in all of sessions.
Author: guowei2 <guowei2@asiainfo.com>
Closes #3617 from guowei2/SPARK-4756 and squashes the following commits:
e9b97b8 [guowei2] fix compile bug with Shim12
cf0f521 [guowei2] Merge remote-tracking branch 'apache/master' into SPARK-4756
e070998 [guowei2] fix: remove active pool of the session when it expired
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...arquetFile accept hadoop glob pattern in path.
Author: Thu Kyaw <trk007@gmail.com>
Closes #3407 from tkyaw/master and squashes the following commits:
19115ad [Thu Kyaw] Merge https://github.com/apache/spark
ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.
More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf
The generic idea of the implementations are :
1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
2 Explode each of the input row, and then feed them to `Aggregate`
* Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
* Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
* Output Schema of `Explode` is `child.output :+ grouping__id`
* GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
* Keep the `Aggregation expressions` the same for the `Aggregate`
The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.
A known issue will be done in the follow up PR:
* Optimization `ColumnPruning` is not supported yet for `Explosive` node.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits:
fe65fcc [Cheng Hao] Remove the extra space
3547056 [Cheng Hao] Add more doc and Simplify the Expand
a7c869d [Cheng Hao] update code as feedbacks
d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
414b165 [Cheng Hao] revert the unnecessary changes
ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
constant inspectors support
Supported passing array to percentile and percentile_approx UDAFs
To support percentile_approx, constant inspectors are supported for GenericUDAF
Constant folding support added to CreateArray expression
Avoided constant udf expression re-evaluation
Author: Venkata Ramana G <ramana.gollamudihuawei.com>
Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
Closes #2802 from gvramana/percentile_array_support and squashes the following commits:
a0182e5 [Venkata Ramana Gollamudi] fixed review comment
a18f917 [Venkata Ramana Gollamudi] avoid constant udf expression re-evaluation - fixes failure due to return iterator and value type mismatch
c46db0f [Venkata Ramana Gollamudi] Removed TestHive reset
4d39105 [Venkata Ramana Gollamudi] Unified inspector creation, style check fixes
f37fd69 [Venkata Ramana Gollamudi] Fixed review comments
47f6365 [Venkata Ramana Gollamudi] fixed test
cb7c61e [Venkata Ramana Gollamudi] Supported ConstantInspector for UDAF Fixed HiveUdaf wrap object issue.
7f94aff [Venkata Ramana Gollamudi] Added foldable support to CreateArray
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
empty string or nul...
```
TestSQLContext.sparkContext.parallelize(
"""{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
"""{"ip":"27.31.100.29","headers":{}}""" ::
"""{"ip":"27.31.100.29","headers":""}""" :: Nil)
```
As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3708 from chenghao-intel/json and squashes the following commits:
e7a72e9 [Cheng Hao] add more concise unit test
853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #3727 from marmbrus/parquetNotEq and squashes the following commits:
2157bfc [Michael Armbrust] Fix parquet filter suite
|
|
|
|
|
|
|
|
|
|
|
|
| |
In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #2589 from chenghao-intel/source_split and squashes the following commits:
dff38e7 [Cheng Hao] Remove the extra blank line
160a2b6 [Cheng Hao] fix the compiling bug
04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
|
|
|
|
|
|
|
|
|
| |
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #3616 from adrian-wang/sqrt and squashes the following commits:
d877439 [Daoyuan Wang] fix NULLTYPE
3effa2c [Daoyuan Wang] sqrt(negative value) should return null
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
with nulls for Parquet
Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.
However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1
This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3367 from liancheng/filters-with-null and squashes the following commits:
cc41281 [Cheng Lian] Fixes several styling issues
de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
|
|
|
|
|
|
|
|
|
|
| |
Based on #2543.
Author: Michael Armbrust <michael@databricks.com>
Closes #3724 from marmbrus/resolveGetField and squashes the following commits:
0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode.
The scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread(HiveThriftServer2 ) which was still alive and cause this problem.
To fix it, make a demo thread to detect the sparkContext is null or not.If the sc is stopped, call the ThriftServer.stop to stop the user thread.
Author: carlmartin <carlmartinmax@gmail.com>
Closes #3576 from SaintBacchus/ThriftServer2ExitBug and squashes the following commits:
2890b4a [carlmartin] Use SparkListener instead of the demo thread to stop the hive server.
c15da0e [carlmartin] HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add `sort by` support for both DSL & SqlParser.
This PR is relevant with #3386, either one merged, will cause the other rebased.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3481 from chenghao-intel/sortby and squashes the following commits:
041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
So flowing cmd work
```
create temporary table normal_parquet
USING org.apache.spark.sql.parquet
OPTIONS (
PATH '/xxx/data'
)
```
Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Closes #3470 from scwf/ddl-ulcase and squashes the following commits:
ae78509 [scwf] address comments
8f4f585 [wangfei] address comments
3c132ef [scwf] minor fix
a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
4f86401 [scwf] adding CaseInsensitiveMap
e244e8d [wangfei] using lower case in json
e0cb017 [wangfei] make options in-casesensitive
|
|
|
|
|
|
|
|
|
|
|
| |
This PR brings support of using StructType(and other hashable types) as key in MapType.
Author: Davies Liu <davies@databricks.com>
Closes #3714 from davies/fix_struct_in_map and squashes the following commits:
68585d7 [Davies Liu] fix primitive types in MapType
9601534 [Davies Liu] support StructType as key in MapType
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3595 from chenghao-intel/udf0 and squashes the following commits:
a858973 [Cheng Hao] Add 0 arguments support for udf
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up of SPARK-4593 (#3443).
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3581 from ueshin/issues/SPARK-4720 and squashes the following commits:
c3959d4 [Takuya UESHIN] Make Remainder return null if the divider is 0.
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3606 from chenghao-intel/codegen_short_circuit and squashes the following commits:
f466303 [Cheng Hao] short circuit for AND & OR
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:
- `ParquetQuerySuite`
- `ParquetTestData`
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3644 from liancheng/parquet-tests and squashes the following commits:
800e745 [Cheng Lian] Enforces ordering of test output
3bb8731 [Cheng Lian] Refactors HiveParquetSuite
aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table.
In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment.
Author: Jacky Li <jacky.likun@huawei.com>
Closes #3133 from jackylk/timeout-config and squashes the following commits:
733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala
557acd4 [Jacky Li] switch to sqlContext.getConf
81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin
|
|
|
|
|
|
|
|
|
|
|
|
| |
Project(Star,...)).
Since `AttributeReference` resolution and `*` expansion are currently in separate rules, each pair requires a full iteration instead of being able to resolve in a single pass. Since its pretty easy to construct queries that have many of these in a row, I combine them into a single rule in this PR.
Author: Michael Armbrust <michael@databricks.com>
Closes #3674 from marmbrus/projectStars and squashes the following commits:
d83d6a1 [Michael Armbrust] Fix resolution of deeply nested Project(attr, Project(Star,...)).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.
for
table test_csv contains 1 million records
table dim_csv contains 10 thousand records
SQL:
`select * from test_csv a left outer join dim_csv b on a.key = b.key`
the result is:
master:
```
CSV: 12671 ms
CSV: 9021 ms
CSV: 9200 ms
Current Mem Usage:787788984
```
after patch:
```
CSV: 10382 ms
CSV: 7543 ms
CSV: 7469 ms
Current Mem Usage:208145728
```
Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes #3375 from tianyi/SPARK-4483 and squashes the following commits:
72a8aec [tianyi] avoid having mutable state stored inside of the task
99c5c97 [tianyi] performance optimization
d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
2be45d1 [tianyi] fix spell bug
1f2c6f1 [tianyi] remove commented codes
a676de6 [tianyi] optimize some codes
9e7d5b5 [tianyi] remove commented old codes
838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
|
|
|
|
|
|
|
|
|
|
|
| |
In `JoinSuite` add BroadcastNestedLoopJoin operator selection testsuite
Author: wangxiaojing <u9jing@gmail.com>
Closes #3395 from wangxiaojing/SPARK-4527 and squashes the following commits:
ea0e495 [wangxiaojing] change style
53c3952 [wangxiaojing] Add BroadcastNestedLoopJoin operator selection testsuite
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue.
```Scala
scala> :paste
// Entering paste mode (ctrl-D to finish)
abstract class Foo {
protected val sqlContext = "Foo"
val codegenEnabled: Boolean = {
println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized.
if (sqlContext != null) {
true
} else {
false
}
}
}
class Bar extends Foo {
override val sqlContext = "Bar"
}
println(new Bar().codegenEnabled)
// Exiting paste mode, now interpreting.
null
false
defined class Foo
defined class Bar
```
We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly.
Author: zsxwing <zsxwing@gmail.com>
Closes #3660 from zsxwing/SPARK-4812 and squashes the following commits:
1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly
|
|
|
|
|
|
|
|
| |
Author: jerryshao <saisai.shao@intel.com>
Closes #3698 from jerryshao/SPARK-4847 and squashes the following commits:
4741130 [jerryshao] Make later added extraStrategies effect when calling strategies
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.
Author: Judy Nash <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>
Closes #3672 from judynash/master and squashes the following commits:
526315d [Judy Nash] correct spacing on startThriftServer method
31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
2b1d312 [Judy Nash] updated http thrift server support based on feedback
377532c [judynash] add HTTP protocol spark thrift server
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
from Hive's LazyBinaryInteger
This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.
Author: Sean Owen <sowen@cloudera.com>
Closes #3692 from srowen/SPARK-4814 and squashes the following commits:
caca704 [Sean Owen] Disable assertions just for Hive
f71e783 [Sean Owen] Enable assertions for SBT and Maven build
|
|
|
|
|
|
|
|
| |
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #3676 from adrian-wang/countexpr and squashes the following commits:
dc5765b [Daoyuan Wang] add rule to fold count(expr) if expr is not null
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AppendingParquetOutputFormat should be zero padded
When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes #3602 from sasakitoa/parquet-zeroPadding and squashes the following commits:
6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding
20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix bug when query like:
```
test("save join to table") {
val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString))
sql("CREATE TABLE test1 (key INT, value STRING)")
testData.insertInto("test1")
sql("CREATE TABLE test2 (key INT, value STRING)")
testData.insertInto("test2")
testData.insertInto("test2")
sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test")
checkAnswer(
table("test"),
sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
}
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3673 from chenghao-intel/spark_4825 and squashes the following commits:
e8cbd56 [Cheng Hao] alternate the pattern matching order for logical plan:CTAS
e004895 [Cheng Hao] fix bug
|
|
|
|
|
|
|
|
|
|
| |
This is fixed by SPARK-4318 #3184
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #3445 from adrian-wang/emptyaggr and squashes the following commits:
982575e [Daoyuan Wang] enable empty aggr test case
|
|
|
|
|
|
|
|
|
|
|
|
| |
So the optimizations are not valid. Also I think the optimization here is rarely encounter, so removing them will not have influence on performance.
Can we merge #3445 before I add a comparison test case from this?
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #3675 from adrian-wang/sumempty and squashes the following commits:
42df763 [Daoyuan Wang] sum and avg on empty table should always return null
|
|
|
|
|
|
|
|
|
|
|
| |
a follow up of #3547
/cc marmbrus
Author: scwf <wangfei1@huawei.com>
Closes #3563 from scwf/rnc and squashes the following commits:
9395661 [scwf] remove unnecessary condition
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Inserting data of type including `ArrayType.containsNull == false` or `MapType.valueContainsNull == false` or `StructType.fields.exists(_.nullable == false)` into Hive table will fail because `Cast` inserted by `HiveMetastoreCatalog.PreInsertionCasts` rule of `Analyzer` can't handle these types correctly.
Complex type cast rule proposal:
- Cast for non-complex types should be able to cast the same as before.
- Cast for `ArrayType` can evaluate if
- Element type can cast
- Nullability rule doesn't break
- Cast for `MapType` can evaluate if
- Key type can cast
- Nullability for casted key type is `false`
- Value type can cast
- Nullability rule for value type doesn't break
- Cast for `StructType` can evaluate if
- The field size is the same
- Each field can cast
- Nullability rule for each field doesn't break
- The nested structure should be the same.
Nullability rule:
- If the casted type is `nullable == true`, the target nullability should be `true`
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3150 from ueshin/issues/SPARK-4293 and squashes the following commits:
e935939 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
ba14003 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
8999868 [Takuya UESHIN] Fix a test title.
f677c30 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293
287f410 [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table.
4f71bb8 [Takuya UESHIN] Make Cast be able to handle complex types.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fix a TODO in Analyzer:
// TODO: pass this in as a parameter
val fixedPoint = FixedPoint(100)
Author: Jacky Li <jacky.likun@huawei.com>
Closes #3499 from jackylk/config and squashes the following commits:
4c1252c [Jacky Li] fix scalastyle
820f460 [Jacky Li] pass maxIterations in as a parameter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Whitelist more hive unit test:
"create_like_tbl_props"
"udf5"
"udf_java_method"
"decimal_1"
"udf_pmod"
"udf_to_double"
"udf_to_float"
"udf7" (this will fail in Hive 0.12)
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3522 from chenghao-intel/unittest and squashes the following commits:
f54e4c7 [Cheng Hao] work around to clean up the hive.table.parameters.default in reset
16fee22 [Cheng Hao] Whitelist more unittest
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
not persisted
Unpersist a uncached RDD, will not raise exception, for example:
```
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.unpersist(true)
```
But the `SchemaRDD` will raise exception if the `SchemaRDD` is not cached. Since `SchemaRDD` is the subclasses of the `RDD`, we should follow the same behavior.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3572 from chenghao-intel/try_uncache and squashes the following commits:
50a7a89 [Cheng Hao] SchemaRDD.unpersist() should not raise exception if it is not persisted
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.
Needed for [https://github.com/apache/spark/pull/3637]
CC: marmbrus
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #3646 from jkbradley/sql-reflection and squashes the following commits:
796b2e4 [Joseph K. Bradley] Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
a wrapper
Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.
Author: Cheng Hao <hao.cheng@intel.com>
Author: Cheng Lian <lian@databricks.com>
Closes #3640 from chenghao-intel/udf_serde and squashes the following commits:
8e13756 [Cheng Hao] Update the comment
74466a3 [Cheng Hao] refactor as feedbacks
396c0e1 [Cheng Hao] avoid Simple UDF to be serialized
e9c3212 [Cheng Hao] update the comment
19cbd46 [Cheng Hao] support udf instance ser/de after initialization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the code refactor and follow ups for #2570
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3336 from chenghao-intel/createtbl and squashes the following commits:
3563142 [Cheng Hao] remove the unused variable
e215187 [Cheng Hao] eliminate the compiling warning
4f97f14 [Cheng Hao] fix bug in unittest
5d58812 [Cheng Hao] revert the API changes
b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS
|
|
|
|
|
|
|
|
| |
Author: Jacky Li <jacky.likun@huawei.com>
Closes #3630 from jackylk/remove and squashes the following commits:
150e7e0 [Jacky Li] remove unnecessary import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly).
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3621 from liancheng/kryo-by-default and squashes the following commits:
70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits:
4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.
|