| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds a new method for evaluating expressions using code that is generated though Scala reflection. This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.
Evaluation can be done in several specialized ways:
- *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row. This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
- *Ordering* - Compares two rows based on a list of `SortOrder` expressions
- *Condition* - Returns `true` or `false` given an input row.
For each of the above operations there is both a Generated and Interpreted version. When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class. Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.
This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code. Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`. This functionality will be extended in a future PR.
This PR also performs several clean ups that simplified the implementation:
- The notion of `Binding` all expressions in a tree automatically before query execution has been removed. Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above. In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`. There are a few reasons for this change: First, there were many operators where it just didn't work before. For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken. Specifically, we have had a few bugs where partitioning breaks because of the binding.
- A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner. Before this was done ad-hoc for the nodes that needed this. However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.
Author: Michael Armbrust <michael@databricks.com>
Closes #993 from marmbrus/newCodeGen and squashes the following commits:
96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
fed3634 [Michael Armbrust] Inspectors are not serializable.
ef8d42b [Michael Armbrust] comments
533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
3cd773e [Michael Armbrust] Allow codegen for Generate.
64b2ee1 [Michael Armbrust] Implement copy
3587460 [Michael Armbrust] Drop unused string builder function.
9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
1a61293 [Michael Armbrust] Address review comments.
0672e8a [Michael Armbrust] Address comments.
1ec2d6e [Michael Armbrust] Address comments
033abc6 [Michael Armbrust] off by default
4771fab [Michael Armbrust] Docs, more test coverage.
d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
bc88ecd [Michael Armbrust] Style
6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
4220f1e [Michael Armbrust] Better config, docs, etc.
ca6cc6b [Michael Armbrust] WIP
9d67d85 [Michael Armbrust] Fix hive planner
fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
e742640 [Michael Armbrust] Remove unneeded changes and code.
675e679 [Michael Armbrust] Upgrade paradise.
0093376 [Michael Armbrust] Comment / indenting cleanup.
d81f998 [Michael Armbrust] include schema for binding.
0e889e8 [Michael Armbrust] Use typeOf instead tq
f623ffd [Michael Armbrust] Quiet logging from test suite.
efad14f [Michael Armbrust] Remove some half finished functions.
92e74a4 [Michael Armbrust] add overrides
a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1638 from marmbrus/cachedConfig and squashes the following commits:
2362082 [Michael Armbrust] Use SQLConf to configure in-memory columnar caching
|
|
|
|
|
|
|
|
|
|
| |
For queries like `... HAVING COUNT(*) > 9` the expression is always resolved since it contains no attributes. This was causing us to avoid doing the Having clause aggregation rewrite.
Author: Michael Armbrust <michael@databricks.com>
Closes #1640 from marmbrus/havingNoRef and squashes the following commits:
92d3901 [Michael Armbrust] Don't check resolved for having filters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
logical plans & sample usage.
The idea is that every Catalyst logical plan gets hold of a Statistics class, the usage of which provides useful estimations on various statistics. See the implementations of `MetastoreRelation`.
This patch also includes several usages of the estimation interface in the planner. For instance, we now use physical table sizes from the estimate interface to convert an equi-join to a broadcast join (when doing so is beneficial, as determined by a size threshold).
Finally, there are a couple minor accompanying changes including:
- Remove the not-in-use `BaseRelation`.
- Make SparkLogicalPlan take a `SQLContext` in the second param list.
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes #1238 from concretevitamin/estimates and squashes the following commits:
329071d [Zongheng Yang] Address review comments; turn config name from string to field in SQLConf.
8663e84 [Zongheng Yang] Use BigInt for stat; for logical leaves, by default throw an exception.
2f2fb89 [Zongheng Yang] Fix statistics for SparkLogicalPlan.
9951305 [Zongheng Yang] Remove childrenStats.
16fc60a [Zongheng Yang] Avoid calling statistics on plans if auto join conversion is disabled.
8bd2816 [Zongheng Yang] Add a note on performance of statistics.
6e594b8 [Zongheng Yang] Get size info from metastore for MetastoreRelation.
01b7a3e [Zongheng Yang] Update scaladoc for a field and move it to @param section.
549061c [Zongheng Yang] Remove numTuples in Statistics for now.
729a8e2 [Zongheng Yang] Update docs to be more explicit.
573e644 [Zongheng Yang] Remove singleton SQLConf and move back `settings` to the trait.
2d99eb5 [Zongheng Yang] {Cleanup, use synchronized in, enrich} StatisticsSuite.
ca5b825 [Zongheng Yang] Inject SQLContext into SparkLogicalPlan, removing SQLConf mixin from it.
43d38a6 [Zongheng Yang] Revert optimization for BroadcastNestedLoopJoin (this fixes tests).
0ef9e5b [Zongheng Yang] Use multiplication instead of sum for default estimates.
4ef0d26 [Zongheng Yang] Make Statistics a case class.
3ba8f3e [Zongheng Yang] Add comment.
e5bcf5b [Zongheng Yang] Fix optimization conditions & update scala docs to explain.
7d9216a [Zongheng Yang] Apply estimation to planning ShuffleHashJoin & BroadcastNestedLoopJoin.
73cde01 [Zongheng Yang] Move SQLConf back. Assign default sizeInBytes to SparkLogicalPlan.
73412be [Zongheng Yang] Move SQLConf to Catalyst & add default val for sizeInBytes.
7a60ab7 [Zongheng Yang] s/Estimates/Statistics, s/cardinality/numTuples.
de3ae13 [Zongheng Yang] Add parquetAfter() properly in test.
dcff9bd [Zongheng Yang] Cleanups.
84301a4 [Zongheng Yang] Refactors.
5bf5586 [Zongheng Yang] Typo.
56a8e6e [Zongheng Yang] Prototype impl of estimations for Catalyst logical plans.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Datetime and time in Python will be converted into java.util.Calendar after serialization, it will be converted into java.sql.Timestamp during inferSchema().
In javaToPython(), Timestamp will be converted into Calendar, then be converted into datetime in Python after pickling.
Author: Davies Liu <davies.liu@gmail.com>
Closes #1601 from davies/date and squashes the following commits:
f0599b0 [Davies Liu] remove tests for sets and tuple in sql, fix list of list
c9d607a [Davies Liu] convert datetype for runtime
709d40d [Davies Liu] remove brackets
96db384 [Davies Liu] support datetime type for SchemaRDD
|
|
|
|
|
|
|
|
|
|
|
|
| |
twice
JIRA: https://issues.apache.org/jira/browse/SPARK-2730
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1637 from yhuai/SPARK-2730 and squashes the following commits:
1a9f24e [Yin Huai] Remove unnecessary key evaluation.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. there's no `hook_context.q` but a `hook_context_cs.q` in query folder
2. there's no `compute_stats_table.q` in query folder
3. there's no `having1.q` in query folder
4. `udf_E` and `udf_PI` appear twice in white list
Author: Daoyuan <daoyuan.wang@intel.com>
Closes #1634 from adrian-wang/testcases and squashes the following commits:
d7482ce [Daoyuan] change some test lists
|
|
|
|
|
|
|
|
|
| |
Author: Aaron Staple <astaple@gmail.com>
Closes #1630 from staple/minor and squashes the following commits:
6f295a2 [Aaron Staple] Fix typos in comment about ExprId.
8566467 [Aaron Staple] Fix off by one column indentation in SqlParser.
|
|
|
|
|
|
|
|
| |
Author: Yadong Qi <qiyadong2010@gmail.com>
Closes #1629 from watermen/bug-fix2 and squashes the following commits:
59b7237 [Yadong Qi] Update HiveQl.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.
In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:
629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table & partitions.
This is the follow up with:
https://github.com/apache/spark/pull/1408
https://github.com/apache/spark/pull/1390
I've run a micro benchmark in my local with 15000000 records totally, and got the result as below:
With This Patch | Partition-Based Table | Non-Partition-Based Table
------------ | ------------- | -------------
No | 1927 ms | 1885 ms
Yes | 1541 ms | 1524 ms
It showed this patch will also improve the performance.
PS: the benchmark code is also attached. (thanks liancheng )
```
package org.apache.spark.sql.hive
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object HiveTableScanPrepare extends App {
case class Record(key: String, value: String)
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
val rdd = sparkContext.parallelize((1 to 3000000).map(i => Record(s"$i", s"val_$i")))
import hiveContext._
hql("SHOW TABLES")
hql("DROP TABLE if exists part_scan_test")
hql("DROP TABLE if exists scan_test")
hql("DROP TABLE if exists records")
rdd.registerAsTable("records")
hql("""CREATE TABLE part_scan_test (key STRING, value STRING) PARTITIONED BY (part1 string, part2 STRING)
| ROW FORMAT SERDE
| 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
| STORED AS RCFILE
""".stripMargin)
hql("""CREATE TABLE scan_test (key STRING, value STRING)
| ROW FORMAT SERDE
| 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
| STORED AS RCFILE
""".stripMargin)
for (part1 <- 2000 until 2001) {
for (part2 <- 1 to 5) {
hql(s"""from records
| insert into table part_scan_test PARTITION (part1='$part1', part2='2010-01-$part2')
| select key, value
""".stripMargin)
hql(s"""from records
| insert into table scan_test select key, value
""".stripMargin)
}
}
}
object HiveTableScanTest extends App {
val sparkContext = new SparkContext(
new SparkConf()
.setMaster("local")
.setAppName(getClass.getSimpleName.stripSuffix("$")))
val hiveContext = new LocalHiveContext(sparkContext)
import hiveContext._
hql("SHOW TABLES")
val part_scan_test = hql("select key, value from part_scan_test")
val scan_test = hql("select key, value from scan_test")
val r_part_scan_test = (0 to 5).map(i => benchmark(part_scan_test))
val r_scan_test = (0 to 5).map(i => benchmark(scan_test))
println("Scanning Partition-Based Table")
r_part_scan_test.foreach(printResult)
println("Scanning Non-Partition-Based Table")
r_scan_test.foreach(printResult)
def printResult(result: (Long, Long)) {
println(s"Duration: ${result._1} ms Result: ${result._2}")
}
def benchmark(srdd: SchemaRDD) = {
val begin = System.currentTimeMillis()
val result = srdd.count()
val end = System.currentTimeMillis()
((end - begin), result)
}
}
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1439 from chenghao-intel/hadoop_table_scan and squashes the following commits:
888968f [Cheng Hao] Fix issues in code style
27540ba [Cheng Hao] Fix the TableScan Bug while partition serde differs
40a24a7 [Cheng Hao] Add Unit Test
|
|
|
|
| |
This reverts commit f6ff2a61d00d12481bfb211ae13d6992daacdcc2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1600 from liancheng/jdbc and squashes the following commits:
ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1557 from marmbrus/fixDivision and squashes the following commits:
b85077f [Michael Armbrust] Fix unit tests.
af98f29 [Michael Armbrust] Change DIV to long type
0c29ae8 [Michael Armbrust] Fix division semantics for hive
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 06dc0d2c6b69c5d59b4d194ced2ac85bfe2e05e2.
#1399 is making Jenkins fail. We should investigate and put this back after its passing tests.
Author: Michael Armbrust <michael@databricks.com>
Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
|
|
|
|
|
|
|
|
|
|
|
| |
I think it's better to defined hiveQlTable as a val
Author: baishuo(白硕) <vc_java@hotmail.com>
Closes #1569 from baishuo/patch-1 and squashes the following commits:
dc2f895 [baishuo(白硕)] Update HiveMetastoreCatalog.scala
a7b32a2 [baishuo(白硕)] Update HiveMetastoreCatalog.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue:
- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
TODO
- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users
----
Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```
This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1399 from liancheng/thriftserver and squashes the following commits:
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hive Supports the operator "<=>", which returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1570 from chenghao-intel/equalns and squashes the following commits:
8d6c789 [Cheng Hao] Remove the test case orc_predicate_pushdown
5b2ca88 [Cheng Hao] Add cases into whitelist
8e66cdd [Cheng Hao] Rename the EqualNSTo ==> EqualNullSafe
7af4b0b [Cheng Hao] Add EqualNS & Unit Tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
collections to Scala collections JsonRDD.scala
In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList.
I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below.
```scala
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val jsonData = sc.textFile("...")
jsonData.cache.count
val jsonSchemaRDD = sqlContext.jsonRDD(jsonData)
jsonSchemaRDD.registerAsTable("jt")
sqlContext.sql("select count(*) from jt").collect
```
Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d.
From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below.
```
Original:
Run 1: 26.1s
Run 2: 27.03s
Run 3: 27.035s
With this change:
Run 1: 21.086s
Run 2: 21.035s
Run 3: 21.029s
```
JIRA: https://issues.apache.org/jira/browse/SPARK-2603
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1504 from yhuai/removeToMapToList and squashes the following commits:
6831b77 [Yin Huai] Fix failed tests.
09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList
d1abdb8 [Yin Huai] Remove unnecessary toMap and toList.
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits:
ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.
|
|
|
|
|
|
|
|
| |
Author: witgo <witgo@qq.com>
Closes #1403 from witgo/hive_compatibility and squashes the following commits:
4e5ecdb [witgo] The default does not run hive compatibility tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
resource pool in Spark SQL for Kryo instances.
Author: Ian O Connell <ioconnell@twitter.com>
Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:
5498566 [Ian O Connell] Docs update suggested by Patrick
20e8555 [Ian O Connell] Slight style change
f92c294 [Ian O Connell] Add docs for new KryoSerializer option
f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
|
|
|
|
|
|
|
|
|
|
| |
Instead of shipping just the name and then looking up the info on the workers, we now ship the whole classname. Also, I refactored the file as it was getting pretty large to move out the type conversion code to its own file.
Author: Michael Armbrust <michael@databricks.com>
Closes #1552 from marmbrus/fixTempUdfs and squashes the following commits:
b695904 [Michael Armbrust] Make add jar execute with Hive. Ship the whole function class name since sometimes we cannot lookup temporary functions on the workers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
aren't in the aggregation list
This change adds an analyzer rule to
1. find expressions in `HAVING` clause filters that depend on unresolved attributes,
2. push these expressions down to the underlying aggregates, and then
3. project them away above the filter.
It also enables the `HAVING` queries in the Hive compatibility suite.
Author: William Benton <willb@redhat.com>
Closes #1497 from willb/spark-2226 and squashes the following commits:
92c9a93 [William Benton] Removed unnecessary import
f1d4f34 [William Benton] Cleanups missed in prior commit
0e1624f [William Benton] Incorporated suggestions from @marmbrus; thanks!
541d4ee [William Benton] Cleanups from review
5a12647 [William Benton] Explanatory comments and stylistic cleanups.
c7f2b2c [William Benton] Whitelist HAVING queries.
29a26e3 [William Benton] Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)
|
|
|
|
|
|
|
|
|
|
| |
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1491 from ueshin/issues/SPARK-2588 and squashes the following commits:
43d0a46 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2588
1023ea0 [Takuya UESHIN] Modify tests to use DSLs.
2310bf1 [Takuya UESHIN] Add some more DSLs.
|
|
|
|
|
|
|
|
|
|
| |
Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1522 from chenghao-intel/equal and squashes the following commits:
f62a0ff [Cheng Hao] Add == Support for HiveQl
|
|
|
|
|
|
|
|
|
|
|
| |
We need to use the analyzed attributes otherwise we end up with a tree that will never resolve.
Author: Michael Armbrust <michael@databricks.com>
Closes #1470 from marmbrus/fixApplySchema and squashes the following commits:
f968195 [Michael Armbrust] Use analyzed attributes when applying the schema.
4969015 [Michael Armbrust] Add test case.
|
|
|
|
|
|
|
|
|
|
| |
Result may not be returned in the expected order, so relax that constraint.
Author: Aaron Davidson <aaron@databricks.com>
Closes #1514 from aarondav/flakey and squashes the following commits:
e5af823 [Aaron Davidson] Fix flakey HiveQuerySuite test
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue: [SPARK-2190](https://issues.apache.org/jira/browse/SPARK-2190)
Added specialized in-memory column type for `Timestamp`. Whitelisted all timestamp related Hive tests except `timestamp_udf`, which is timezone sensitive.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1440 from liancheng/timestamp-column-type and squashes the following commits:
e682175 [Cheng Lian] Enabled more timezone sensitive Hive tests.
53a358f [Cheng Lian] Fixed failed test suites
01b592d [Cheng Lian] Fixed SimpleDateFormat thread safety issue
2a59343 [Cheng Lian] Removed timezone sensitive Hive timestamp tests
45dd05d [Cheng Lian] Added Timestamp specific in-memory columnar representation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
follow-up of #1359
Author: chutium <teng.qiu@gmail.com>
Closes #1442 from chutium/master and squashes the following commits:
b49cc8a [chutium] SPARK-2407: Added Parser of SQL SUBSTRING() #1442
9a60ccf [chutium] SPARK-2407: Added Parser of SQL SUBSTR() #1442
06e933b [chutium] Merge https://github.com/apache/spark
c870172 [chutium] Merge https://github.com/apache/spark
094f773 [chutium] Merge https://github.com/apache/spark
88cb37d [chutium] Merge https://github.com/apache/spark
1de83a7 [chutium] SPARK-2407: Added Parse of SQL SUBSTR()
|
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1436 from chenghao-intel/unwrapdata and squashes the following commits:
34cc21a [Cheng Hao] update the table scan accodringly since the unwrapData function changed
afc39da [Cheng Hao] Polish the code
39d6475 [Cheng Hao] Add HiveDecimal & HiveVarchar support in unwrap data
|
|
|
|
|
|
|
|
|
|
|
| |
`StringComparison` expressions including `null` literal cases could be added to `NullPropagation`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1451 from ueshin/issues/SPARK-2535 and squashes the following commits:
e99c237 [Takuya UESHIN] Add some tests.
8f9b984 [Takuya UESHIN] Add StringComparison case to NullPropagation.
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up of #1428.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1432 from ueshin/issues/SPARK-2518 and squashes the following commits:
37d1ace [Takuya UESHIN] Fix foldability of Substring expression.
|
|
|
|
|
|
|
|
|
|
|
| |
Moved couple rules out of NullPropagation and added more comments.
Author: Reynold Xin <rxin@apache.org>
Closes #1430 from rxin/sql-folding-rule and squashes the following commits:
7f9a197 [Reynold Xin] Updated documentation for ConstantFolding.
7f8cf61 [Reynold Xin] [SQL] Cleaned up ConstantFolding slightly.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-2525.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1444 from yhuai/SPARK-2517 and squashes the following commits:
edbac3f [Yin Huai] Removed some compiler type erasure warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119)
Essentially this PR fixed three issues to gain much better performance when reading large Parquet file off S3.
1. When reading the schema, fetching Parquet metadata from a part-file rather than the `_metadata` file
The `_metadata` file contains metadata of all row groups, and can be very large if there are many row groups. Since schema information and row group metadata are coupled within a single Thrift object, we have to read the whole `_metadata` to fetch the schema. On the other hand, schema is replicated among footers of all part-files, which are fairly small.
1. Only add the root directory of the Parquet file rather than all the part-files to input paths
HDFS API can automatically filter out all hidden files and underscore files (`_SUCCESS` & `_metadata`), there's no need to filter out all part-files and add them individually to input paths. What make it much worse is that, `FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each individual input path sequentially, each results a blocking remote S3 HTTP request.
1. Worked around [PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16)
Essentially PARQUET-16 is similar to the above issue, and results lots of sequential `FileSystem.getFileStatus()` calls, which are further translated into a bunch of remote S3 HTTP requests.
`FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is fixed.
Below is the micro benchmark result. The dataset used is a S3 Parquet file consists of 3,793 partitions, about 110MB per partition in average. The benchmark is done with a 9-node AWS cluster.
- Creating a Parquet `SchemaRDD` (Parquet schema is fetched)
```scala
val tweets = parquetFile(uri)
```
- Before: 17.80s
- After: 8.61s
- Fetching partition information
```scala
tweets.getPartitions
```
- Before: 700.87s
- After: 21.47s
- Counting the whole file (both steps above are executed altogether)
```scala
parquetFile(uri).count()
```
- Before: ??? (haven't test yet)
- After: 53.26s
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1370 from liancheng/faster-parquet and squashes the following commits:
94a2821 [Cheng Lian] Added comments about schema consistency
d2c4417 [Cheng Lian] Worked around PARQUET-16 to improve Parquet performance
1c0d1b9 [Cheng Lian] Accelerated Parquet schema retrieving
5bd3d29 [Cheng Lian] Fixed Parquet log level
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up of #1359 with nullability narrowing.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1426 from ueshin/issues/SPARK-2504 and squashes the following commits:
5157832 [Takuya UESHIN] Remove unnecessary white spaces.
80958ac [Takuya UESHIN] Fix nullability of Substring expression.
|
|
|
|
|
|
|
|
|
|
| |
`Substring` including `null` literal cases could be added to `NullPropagation`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1428 from ueshin/issues/SPARK-2509 and squashes the following commits:
d9eb85f [Takuya UESHIN] Add Substring cases to NullPropagation.
|
|
|
|
|
|
|
|
|
|
| |
SchemaRDD implementations.
Author: Aaron Staple <aaron.staple@gmail.com>
Closes #1421 from staple/SPARK-2314 and squashes the following commits:
73e04dc [Aaron Staple] [SPARK-2314] Override collect and take in JavaSchemaRDD, forwarding to SchemaRDD implementations.
|
|
|
|
|
|
|
|
|
|
|
|
| |
data type objects.
JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2498
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes #1423 from concretevitamin/scala-ref-catalyst and squashes the following commits:
325a149 [Zongheng Yang] Synchronize on a lock when initializing data type objects in Catalyst.
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1414 from marmbrus/exprIdResolution and squashes the following commits:
97b47bc [Michael Armbrust] Attribute equality comparisons should be done by exprId.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
and adds tests to verify correct operation.
Author: William Benton <willb@redhat.com>
Closes #1359 from willb/internalSqlSubstring and squashes the following commits:
ccedc47 [William Benton] Fixed too-long line.
a30a037 [William Benton] replace view bounds with implicit parameters
ec35c80 [William Benton] Adds fixes from review:
4f3bfdb [William Benton] Added internal implementation of SQL SUBSTR()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
failed to resolve references in the format of "tableName.fieldName"
Please refer to JIRA (https://issues.apache.org/jira/browse/SPARK-2474) for how to reproduce the problem and my understanding of the root cause.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1406 from yhuai/SPARK-2474 and squashes the following commits:
96b1627 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2474
af36d65 [Yin Huai] Fix comment.
be86ba9 [Yin Huai] Correct SQL console settings.
c43ad00 [Yin Huai] Wrap the relation in a Subquery named by the table name in OverrideCatalog.lookupRelation.
a5c2145 [Yin Huai] Support sql/console.
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1396 from marmbrus/moreTests and squashes the following commits:
6660b60 [Michael Armbrust] Blacklist a test that requires DFS command.
8b6001c [Michael Armbrust] Add golden files.
ccd8f97 [Michael Armbrust] Whitelist more tests.
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1411 from marmbrus/nestedRepeated and squashes the following commits:
044fa09 [Michael Armbrust] Fix parsing of repeated, nested data access.
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1412 from marmbrus/lockHiveClient and squashes the following commits:
4bc9d5a [Michael Armbrust] protected[hive]
22e9177 [Michael Armbrust] Add comments.
7aa8554 [Michael Armbrust] Don't lock on hive's object.
a6edc5f [Michael Armbrust] Lock usage of hive client.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Note that this commit changes the semantics when loading in data that was created with prior versions of Spark SQL. Before, we were writing out strings as Binary data without adding any other annotations. Thus, when data is read in from prior versions, data that was StringType will now become BinaryType. Users that need strings can CAST that column to a String. It was decided that while this breaks compatibility, it does make us compatible with other systems (Hive, Thrift, etc) and adds support for Binary data, so this is the right decision long term.
To support `BinaryType`, the following changes are needed:
- Make `StringType` use `OriginalType.UTF8`
- Add `BinaryType` using `PrimitiveTypeName.BINARY` without `OriginalType`
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1373 from ueshin/issues/SPARK-2446 and squashes the following commits:
ecacb92 [Takuya UESHIN] Add BinaryType support to Parquet I/O.
616e04a [Takuya UESHIN] Make StringType use OriginalType.UTF8.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix.
## Benchmarks
Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console.
Without the fix:
Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s)
Stablized runs | 1.21s (1.18s) | 27.6s (27.5s)
With this fix:
Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.57s (1.46s) | 11.0s (1.69s)
Stablized runs | 1.13s (1.10s) | 1.23s (1.19s)
Author: Zongheng Yang <zongheng.y@gmail.com>
Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits:
d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
InMemoryRelation
Reuse byte buffers when creating unique attributes for multiple instances of an InMemoryRelation in a single query plan.
Author: Michael Armbrust <michael@databricks.com>
Closes #1332 from marmbrus/doubleCache and squashes the following commits:
4a19609 [Michael Armbrust] Clean up concurrency story by calculating buffersn the constructor.
b39c931 [Michael Armbrust] Allocations are kind of a side effect.
f67eff7 [Michael Armbrust] Reusue same byte buffers when creating new instance of InMemoryRelation
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1366 from marmbrus/partialDistinct and squashes the following commits:
12a31ab [Michael Armbrust] Add more efficient distinct operator.
|