| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-13260
This is a quicky fix for `count(*)`.
When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.
Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #11169 from HyukjinKwon/SPARK-13260.
|
|
|
|
|
|
|
|
| |
Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not).
Author: Reynold Xin <rxin@databricks.com>
Closes #11171 from rxin/SPARK-13282.
|
|
|
|
|
|
|
|
| |
The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).
Author: Davies Liu <davies@databricks.com>
Closes #11153 from davies/resolve_sort.
|
|
|
|
|
|
|
|
|
|
| |
This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row.
<img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png">
Author: Davies Liu <davies@databricks.com>
Closes #11170 from davies/gen_metric.
|
|
|
|
|
|
|
|
| |
Add the table name validation at the temp table creation
Author: jayadevanmurali <jayadevan.m@tcs.com>
Closes #11051 from jayadevanmurali/branch-0.2-SPARK-12982.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-13277
There is an ANTLR warning during compilation:
warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7:
Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
This patch is to fix it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #11168 from viirya/fix-parser-using.
|
|
|
|
|
|
|
|
| |
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.
|
|
|
|
|
|
|
|
| |
pipeline plan in comments.
Author: Nong Li <nong@databricks.com>
Closes #11155 from nongli/spark-13270.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in SQL
Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan.
For example, before the fix, the following query has a plan with two `Distinct`
```scala
sql("select * from t0 union select * from t0").explain(true)
```
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
+- 'Distinct
+- 'Project [unresolvedalias(*,None)]
+- 'Subquery u_1
+- 'Distinct
+- 'Union
:- 'Project [unresolvedalias(*,None)]
: +- 'UnresolvedRelation `t0`, None
+- 'Project [unresolvedalias(*,None)]
+- 'UnresolvedRelation `t0`, None
== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
+- Distinct
+- Project [id#16L]
+- Subquery u_1
+- Distinct
+- Union
:- Project [id#16L]
: +- Subquery t0
: +- Relation[id#16L] ParquetRelation
+- Project [id#16L]
+- Subquery t0
+- Relation[id#16L] ParquetRelation
== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
+- Union
:- Project [id#16L]
: +- Relation[id#16L] ParquetRelation
+- Project [id#16L]
+- Relation[id#16L] ParquetRelation
```
After the fix, the plan is changed without the extra `Distinct` as follows:
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_1
+- 'Distinct
+- 'Union
:- 'Project [unresolvedalias(*,None)]
: +- 'UnresolvedRelation `t0`, None
+- 'Project [unresolvedalias(*,None)]
+- 'UnresolvedRelation `t0`, None
== Analyzed Logical Plan ==
id: bigint
Project [id#17L]
+- Subquery u_1
+- Distinct
+- Union
:- Project [id#16L]
: +- Subquery t0
: +- Relation[id#16L] ParquetRelation
+- Project [id#16L]
+- Subquery t0
+- Relation[id#16L] ParquetRelation
== Optimized Logical Plan ==
Aggregate [id#17L], [id#17L]
+- Union
:- Project [id#16L]
: +- Relation[id#16L] ParquetRelation
+- Project [id#16L]
+- Relation[id#16L] ParquetRelation
```
Author: gatorsmile <gatorsmile@gmail.com>
Closes #11120 from gatorsmile/unionDistinct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Identifier/Expression string
The parser currently parses the following strings without a hitch:
* Table Identifier:
* `a.b.c` should fail, but results in the following table identifier `a.b`
* `table!#` should fail, but results in the following table identifier `table`
* Expression
* `1+2 r+e` should fail, but results in the following expression `1 + 2`
This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing.
cc cloud-fan (we discussed this in https://github.com/apache/spark/pull/10649) jayadevanmurali (this causes your PR https://github.com/apache/spark/pull/11051 to fail)
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #11159 from hvanhovell/SPARK-13276.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows.
After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary.
This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that).
The new SQL UI will looks like:
![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png)
Author: Davies Liu <davies@databricks.com>
Closes #11163 from davies/remove_metrics.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
Author: Davies Liu <davies@databricks.com>
Closes #10677 from davies/grouping.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR addresses two issues:
- Self join does not work in SQL Generation
- When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost.
liancheng Could you please review the code changes? Thank you!
Author: gatorsmile <gatorsmile@gmail.com>
Closes #11084 from gatorsmile/selfJoinInSQLGen.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ambiguity Caused by Internally Generated Expressions
Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.
This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.
Here's an example Spark 1.6.0 snippet for illustration:
```scala
sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
```
The above code produces the following resolved plan:
```
== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
+- Subquery t
+- Project [id#46L AS a#47L,id#46L AS b#48L]
+- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
```
Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.
The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.
In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.
Could you review the solution? marmbrus liancheng
I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!
Author: gatorsmile <gatorsmile@gmail.com>
Closes #11050 from gatorsmile/namingConflicts.
|
|
|
|
|
|
|
|
| |
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator
Author: raela <raela@databricks.com>
Closes #11158 from raelawang/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
### Management API for Continuous Queries
**API for getting status of each query**
- Whether active or not
- Unique name of each query
- Status of the sources and sinks
- Exceptions
**API for managing each query**
- Immediately stop an active query
- Waiting for a query to be terminated, correctly or with error
**API for managing multiple queries**
- Listing all active queries
- Getting an active query by name
- Waiting for any one of the active queries to be terminated
**API for listening to query life cycle events**
- ContinuousQueryListener API for query start, progress and termination events.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #11030 from tdas/streaming-df-management-api.
|
|
|
|
|
|
|
|
|
|
| |
implemented compression schemes for InMemoryRelation
This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes #10965 from maropu/ImproveColumnarCache.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / #7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators.
This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy.
In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path.
/cc davies and marmbrus for review.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #11145 from JoshRosen/take-ordered-and-project-fix.
|
|
|
|
|
|
| |
Author: Gábor Lipták <gliptak@gmail.com>
Closes #9532 from gliptak/SPARK-11565.
|
|
|
|
|
|
|
|
|
|
| |
`FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically.
This is based on the initial work from marmbrus.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #11034 from zsxwing/stream-df-file-source.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
unnecessary Spark Filter
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
Current plan:
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
+- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
== Physical Plan ==
+- Filter (col0#0 = xxx)
+- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```
This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
+- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
== Physical Plan ==
Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes #10427 from maropu/RemoveFilterInJdbcScan.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR improve the lookup of BytesToBytesMap by:
1. Generate code for calculate the hash code of grouping keys.
2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection).
Author: Davies Liu <davies@databricks.com>
Closes #11010 from davies/gen_map.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds the benchmark results as comments.
The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons:
1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153).
2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth?
3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster.
The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10917 from cloud-fan/hash-benchmark.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
WIP: running tests. Code needs a bit of clean up.
This patch completes the vectorized decoding with the goal of passing the existing
tests. There is still more patches to support the rest of the format spec, even
just for flat schemas.
This patch adds a new flag to enable the vectorized decoding. Tests were updated
to try with both modes where applicable.
Once this is working well, we can remove the previous code path.
Author: Nong Li <nong@databricks.com>
Closes #11055 from nongli/spark-12992-2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR improve the performance for Broadcast join with dimension tables, which is common in data warehouse.
If the join key can fit in a long, we will use a special api `get(Long)` to get the rows from HashedRelation.
If the HashedRelation only have unique keys, we will use a special api `getValue(Long)` or `getValue(InternalRow)`.
If the keys can fit within a long, also the keys are dense, we will use a array of UnsafeRow, instead a hash map.
TODO: will do cleanup
Author: Davies Liu <davies@databricks.com>
Closes #11065 from davies/gen_dim.
|
|
|
|
|
|
|
|
|
|
| |
analysis of encoder
nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #11035 from cloud-fan/ignore-nullability.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen.
At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`.
In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic.
In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #7334 from JoshRosen/remove-copy-in-limit.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
structures
rxin srowen
I work out note message for rdd.take function, please help to review.
If it's fine, I can apply to all other function later.
Author: Tommy YU <tummyyu@163.com>
Closes #10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
|
|
|
|
|
|
|
|
|
| |
Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11.
Also works with 2.10
Author: Jakob Odersky <jakob@odersky.com>
Closes #11085 from jodersky/SPARK-13171.
|
|
|
|
|
|
|
|
| |
Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116.
Author: Davies Liu <davies@databricks.com>
Closes #11097 from davies/remove_fallback.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12939
Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it. Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added.
follow-ups:
* remove encoders from typed aggregate expression.
* completely remove resolve/bind in `ExpressionEncoder`
Author: Wenchen Fan <wenchen@databricks.com>
Closes #10852 from cloud-fan/bug.
|
|
|
|
|
|
|
|
|
|
| |
DataFrameReaderWriterSuite
A follow up PR for #11062 because it didn't rename the test suite.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #11096 from zsxwing/rename.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds option function for boolean, long, and double types. This makes it slightly easier for Spark users to specify options without turning them into strings. Using the JSON data source as an example.
Before this patch:
```scala
sqlContext.read.option("primitivesAsString", "true").json("/path/to/json")
```
After this patch:
Before this patch:
```scala
sqlContext.read.option("primitivesAsString", true).json("/path/to/json")
```
Author: Reynold Xin <rxin@databricks.com>
Closes #11072 from rxin/SPARK-13187.
|
|
|
|
|
|
|
|
| |
Another trivial deprecation fix for Scala 2.11
Author: Jakob Odersky <jakob@odersky.com>
Closes #11089 from jodersky/SPARK-13208.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Tables)
JIRA: https://issues.apache.org/jira/browse/SPARK-12850
This PR is to support bucket pruning when the predicates are `EqualTo`, `EqualNullSafe`, `IsNull`, `In`, and `InSet`.
Like HIVE, in this PR, the bucket pruning works when the bucketing key has one and only one column.
So far, I do not find a way to verify how many buckets are actually scanned. However, I did verify it when doing the debug. Could you provide a suggestion how to do it properly? Thank you! cloud-fan yhuai rxin marmbrus
BTW, we can add more cases to support complex predicate including `Or` and `And`. Please let me know if I should do it in this PR.
Maybe we also need to add test cases to verify if bucket pruning works well for each data type.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10942 from gatorsmile/pruningBuckets.
|
|
|
|
|
|
|
|
| |
This patch incorporates review feedback from #11069, which is already merged.
Author: Andrew Or <andrew@databricks.com>
Closes #11080 from andrewor14/catalog-follow-ups.
|
|
|
|
|
|
|
|
| |
Spark SQL should collapse adjacent `Repartition` operators and only keep the last one.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #11064 from JoshRosen/collapse-repartition.
|
|
|
|
|
|
|
|
| |
This is a small addendum to #10762 to make the code more robust again future changes.
Author: Reynold Xin <rxin@databricks.com>
Closes #11070 from rxin/SPARK-12828-natural-join.
|
|
|
|
|
|
|
|
|
| |
Jira:
https://issues.apache.org/jira/browse/SPARK-12828
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes #10762 from adrian-wang/naturaljoin.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a step towards consolidating `SQLContext` and `HiveContext`.
This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested.
About 200 lines are test code.
Author: Andrew Or <andrew@databricks.com>
Closes #11069 from andrewor14/catalog.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make an internal non-deprecated version of incBytesRead and incRecordsRead so we don't have unecessary deprecation warnings in our build.
Right now incBytesRead and incRecordsRead are marked as deprecated and for internal use only. We should make private[spark] versions which are not deprecated and switch to those internally so as to not clutter up the warning messages when building.
cc andrewor14 who did the initial deprecation
Author: Holden Karau <holden@us.ibm.com>
Closes #11056 from holdenk/SPARK-13152-fix-task-metrics-deprecation-warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query).
Having best time and average time together for more information (we can see kind of variance).
rate, time per row and relative are all calculated using best time.
The result looks like this:
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
rang/filter/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
rang/filter/sum codegen=false 14332 / 16646 36.0 27.8 1.0X
rang/filter/sum codegen=true 845 / 940 620.0 1.6 17.0X
```
Author: Davies Liu <davies@databricks.com>
Closes #11018 from davies/gen_bench.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They seem redundant and we can simply use DataFrameReader/Writer. The new usage looks like:
```scala
val df = sqlContext.read.stream("...")
val handle = df.write.stream("...")
handle.stop()
```
Author: Reynold Xin <rxin@databricks.com>
Closes #11062 from rxin/SPARK-13166.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ```SparkSqlLexer``` currently swallows characters which have not been defined in the grammar. This causes problems with SQL commands, such as: ```add jar file:///tmp/ab/TestUDTF.jar```. In this example the `````` is swallowed.
This PR adds an extra Lexer rule to handle such input, and makes a tiny modification to the ```ASTNode```.
cc davies liancheng
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes #11052 from hvanhovell/SPARK-13157.
|
|
|
|
|
|
|
|
| |
A row from stream side could match multiple rows on build side, the loop for these matched rows should not be interrupted when emitting a row, so we buffer the output rows in a linked list, check the termination condition on producer loop (for example, Range or Aggregate).
Author: Davies Liu <davies@databricks.com>
Closes #10989 from davies/gen_join.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Based on the semantics of a query, we can derive a number of data constraints on output of each (logical or physical) operator. For instance, if a filter defines `‘a > 10`, we know that the output data of this filter satisfies 2 constraints:
1. `‘a > 10`
2. `isNotNull(‘a)`
This PR proposes a possible way of keeping track of these constraints and propagating them in the logical plan, which can then help us build more advanced optimizations (such as pruning redundant filters, optimizing joins, among others). We define constraints as a set of (implicitly conjunctive) expressions. For e.g., if a filter operator has constraints = `Set(‘a > 10, ‘b < 100)`, it’s implied that the outputs satisfy both individual constraints (i.e., `‘a > 10` AND `‘b < 100`).
Design Document: https://docs.google.com/a/databricks.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit?usp=sharing
Author: Sameer Agarwal <sameer@databricks.com>
Closes #10844 from sameeragarwal/constraints.
|
|
|
|
|
|
|
|
|
|
|
| |
1. try to avoid the suffix (unique id)
2. remove the comment if there is no code generated.
3. re-arrange the order of functions
4. trop the new line for inlined blocks.
Author: Davies Liu <davies@databricks.com>
Closes #11032 from davies/better_suffix.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR add spilling support for generated TungstenAggregate.
If spilling happened, it's not that bad to do the iterator based sort-merge-aggregate (not generated).
The changes will be covered by TungstenAggregationQueryWithControlledFallbackSuite
Author: Davies Liu <davies@databricks.com>
Closes #10998 from davies/gen_spilling.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
decoding to ColumnarBatch.
This patch implements support for more types when doing the vectorized decode. There are
a few more types remaining but they should be very straightforward after this. This code
has a few copy and paste pieces but they are difficult to eliminate due to performance
considerations.
Specifically, this patch adds support for:
- String, Long, Byte types
- Dictionary encoding for those types.
Author: Nong Li <nong@databricks.com>
Closes #10908 from nongli/spark-12992.
|