aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-13260][SQL] count(*) does not work with CSV data sourcehyukjinkwon2016-02-122-45/+41
| | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-13260 This is a quicky fix for `count(*)`. When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count. Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11169 from HyukjinKwon/SPARK-13260.
* [SPARK-13282][SQL] LogicalPlan toSql should just return a StringReynold Xin2016-02-126-156/+141
| | | | | | | | Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not). Author: Reynold Xin <rxin@databricks.com> Closes #11171 from rxin/SPARK-13282.
* [SPARK-12705] [SQL] push missing attributes for SortDavies Liu2016-02-123-83/+67
| | | | | | | | The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate). Author: Davies Liu <davies@databricks.com> Closes #11153 from davies/resolve_sort.
* [SPARK-12915][SQL] add SQL metrics of numOutputRows for whole stage codegenDavies Liu2016-02-119-31/+71
| | | | | | | | | | This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row. <img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png"> Author: Davies Liu <davies@databricks.com> Closes #11170 from davies/gen_metric.
* [SPARK-12982][SQL] Add table name validation in temp table registrationjayadevanmurali2016-02-112-1/+13
| | | | | | | | Add the table name validation at the temp table creation Author: jayadevanmurali <jayadevan.m@tcs.com> Closes #11051 from jayadevanmurali/branch-0.2-SPARK-12982.
* [SPARK-13277][SQL] ANTLR ignores other rule using the USING keywordLiang-Chi Hsieh2016-02-111-1/+1
| | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13277 There is an ANTLR warning during compilation: warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input This patch is to fix it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11168 from viirya/fix-parser-using.
* [SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.templateSasaki Toru2016-02-111-1/+1
| | | | | | | | In spark-env.sh.template, there are multi-byte characters, this PR will remove it. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.
* [SPARK-13270][SQL] Remove extra new lines in whole stage codegen and include ↵Nong Li2016-02-102-2/+20
| | | | | | | | pipeline plan in comments. Author: Nong Li <nong@databricks.com> Closes #11155 from nongli/spark-13270.
* [SPARK-13235][SQL] Removed an Extra Distinct from the Plan when Using Union ↵gatorsmile2016-02-112-29/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in SQL Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan. For example, before the fix, the following query has a plan with two `Distinct` ```scala sql("select * from t0 union select * from t0").explain(true) ``` ``` == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_2 +- 'Distinct +- 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#16L] +- Subquery u_2 +- Distinct +- Project [id#16L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#16L], [id#16L] +- Aggregate [id#16L], [id#16L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` After the fix, the plan is changed without the extra `Distinct` as follows: ``` == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#17L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#17L], [id#17L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` Author: gatorsmile <gatorsmile@gmail.com> Closes #11120 from gatorsmile/unionDistinct.
* [SPARK-13276] Catch bad characters at the end of a Table ↵Herman van Hovell2016-02-113-4/+17
| | | | | | | | | | | | | | | | | | | Identifier/Expression string The parser currently parses the following strings without a hitch: * Table Identifier: * `a.b.c` should fail, but results in the following table identifier `a.b` * `table!#` should fail, but results in the following table identifier `table` * Expression * `1+2 r+e` should fail, but results in the following expression `1 + 2` This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing. cc cloud-fan (we discussed this in https://github.com/apache/spark/pull/10649) jayadevanmurali (this causes your PR https://github.com/apache/spark/pull/11051 to fail) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11159 from hvanhovell/SPARK-13276.
* [SPARK-13234] [SQL] remove duplicated SQL metricsDavies Liu2016-02-1024-208/+80
| | | | | | | | | | | | | | | | For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows. After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary. This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that). The new SQL UI will looks like: ![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png) Author: Davies Liu <davies@databricks.com> Closes #11163 from davies/remove_metrics.
* [SPARK-12706] [SQL] grouping() and grouping_id()Davies Liu2016-02-1015-52/+254
| | | | | | | | | | | | Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels. grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR. The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive). Author: Davies Liu <davies@databricks.com> Closes #10677 from davies/grouping.
* [SPARK-13205][SQL] SQL Generation Support for Self Joingatorsmile2016-02-113-2/+22
| | | | | | | | | | | | This PR addresses two issues: - Self join does not work in SQL Generation - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost. liancheng Could you please review the code changes? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #11084 from gatorsmile/selfJoinInSQLGen.
* [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name ↵gatorsmile2016-02-119-36/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ambiguity Caused by Internally Generated Expressions Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`. This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased. Here's an example Spark 1.6.0 snippet for illustration: ```scala sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true) ``` The above code produces the following resolved plan: ``` == Analyzed Logical Plan == _c0: bigint Project [_c0#101L] +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] +- Subquery t +- Project [id#46L AS a#47L,id#46L AS b#48L] +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26 ``` Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs. The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation. In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated. Could you review the solution? marmbrus liancheng I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #11050 from gatorsmile/namingConflicts.
* [SPARK-13274] Fix Aggregator Links on GroupedDataset Scala APIraela2016-02-101-4/+8
| | | | | | | | Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <raela@databricks.com> Closes #11158 from raelawang/master.
* [SPARK-13146][SQL] Management API for continuous queriesTathagata Das2016-02-1017-109/+1680
| | | | | | | | | | | | | | | | | | | | | | | | | | ### Management API for Continuous Queries **API for getting status of each query** - Whether active or not - Unique name of each query - Status of the sources and sinks - Exceptions **API for managing each query** - Immediately stop an active query - Waiting for a query to be terminated, correctly or with error **API for managing multiple queries** - Listing all active queries - Getting an active query by name - Waiting for any one of the active queries to be terminated **API for listening to query life cycle events** - ContinuousQueryListener API for query start, progress and termination events. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11030 from tdas/streaming-df-management-api.
* [SPARK-13057][SQL] Add benchmark codes and the performance results for ↵Takeshi YAMAMURO2016-02-101-0/+240
| | | | | | | | | | implemented compression schemes for InMemoryRelation This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10965 from maropu/ImproveColumnarCache.
* [HOTFIX] Fix Scala 2.10 build break in TakeOrderedAndProjectSuite.Josh Rosen2016-02-101-2/+2
|
* [SPARK-13254][SQL] Fix planning of TakeOrderedAndProject operatorJosh Rosen2016-02-106-44/+159
| | | | | | | | | | | | | | The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / #7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators. This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy. In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path. /cc davies and marmbrus for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #11145 from JoshRosen/take-ordered-and-project-fix.
* [SPARK-11565] Replace deprecated DigestUtils.shaHex callGábor Lipták2016-02-102-2/+6
| | | | | | Author: Gábor Lipták <gliptak@gmail.com> Closes #9532 from gliptak/SPARK-11565.
* [SPARK-13149][SQL] Add FileStreamSourceShixiong Zhu2016-02-096-7/+710
| | | | | | | | | | `FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically. This is based on the initial work from marmbrus. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11034 from zsxwing/stream-df-file-source.
* [SPARK-12476][SQL] Implement JdbcRelation#unhandledFilters for removing ↵Takeshi YAMAMURO2016-02-103-21/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | unnecessary Spark Filter Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10427 from maropu/RemoveFilterInJdbcScan.
* [SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregateDavies Liu2016-02-097-51/+85
| | | | | | | | | | | | This PR improve the lookup of BytesToBytesMap by: 1. Generate code for calculate the hash code of grouping keys. 2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection). Author: Davies Liu <davies@databricks.com> Closes #11010 from davies/gen_map.
* [SPARK-12888] [SQL] [FOLLOW-UP] benchmark the new hash expressionWenchen Fan2016-02-091-7/+33
| | | | | | | | | | | | | | | | Adds the benchmark results as comments. The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons: 1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153). 2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth? 3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster. The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction. Author: Wenchen Fan <wenchen@databricks.com> Closes #10917 from cloud-fan/hash-benchmark.
* [SPARK-12992] [SQL] Support vectorized decoding in UnsafeRowParquetRecordReader.Nong Li2016-02-0816-90/+549
| | | | | | | | | | | | | | | | | WIP: running tests. Code needs a bit of clean up. This patch completes the vectorized decoding with the goal of passing the existing tests. There is still more patches to support the rest of the format spec, even just for flat schemas. This patch adds a new flag to enable the vectorized decoding. Tests were updated to try with both modes where applicable. Once this is working well, we can remove the previous code path. Author: Nong Li <nong@databricks.com> Closes #11055 from nongli/spark-12992-2.
* [SPARK-13095] [SQL] improve performance for broadcast join with dimension tableDavies Liu2016-02-088-69/+438
| | | | | | | | | | | | | | | | This PR improve the performance for Broadcast join with dimension tables, which is common in data warehouse. If the join key can fit in a long, we will use a special api `get(Long)` to get the rows from HashedRelation. If the HashedRelation only have unique keys, we will use a special api `getValue(Long)` or `getValue(InternalRow)`. If the keys can fit within a long, also the keys are dense, we will use a array of UnsafeRow, instead a hash map. TODO: will do cleanup Author: Davies Liu <davies@databricks.com> Closes #11065 from davies/gen_dim.
* [SPARK-13101][SQL] nullability of array type element should not fail ↵Wenchen Fan2016-02-087-104/+64
| | | | | | | | | | analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. Author: Wenchen Fan <wenchen@databricks.com> Closes #11035 from cloud-fan/ignore-nullability.
* [SPARK-8964] [SQL] Use Exchange to perform shuffle in LimitJosh Rosen2016-02-088-160/+223
| | | | | | | | | | | | | | This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen. At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`. In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic. In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it. Author: Josh Rosen <joshrosen@databricks.com> Closes #7334 from JoshRosen/remove-copy-in-limit.
* [SPARK-5865][API DOC] Add doc warnings for methods that return local data ↵Tommy YU2016-02-061-0/+4
| | | | | | | | | | | | | structures rxin srowen I work out note message for rdd.take function, please help to review. If it's fine, I can apply to all other function later. Author: Tommy YU <tummyyu@163.com> Closes #10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
* [SPARK-13171][CORE] Replace future calls with FutureJakob Odersky2016-02-054-6/+6
| | | | | | | | | Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11. Also works with 2.10 Author: Jakob Odersky <jakob@odersky.com> Closes #11085 from jodersky/SPARK-13171.
* [SPARK-13215] [SQL] remove fallback in codegenDavies Liu2016-02-053-66/+8
| | | | | | | | Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116. Author: Davies Liu <davies@databricks.com> Closes #11097 from davies/remove_fallback.
* [SPARK-12939][SQL] migrate encoder resolution logic to AnalyzerWenchen Fan2016-02-0512-104/+230
| | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12939 Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it. Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added. follow-ups: * remove encoders from typed aggregate expression. * completely remove resolve/bind in `ExpressionEncoder` Author: Wenchen Fan <wenchen@databricks.com> Closes #10852 from cloud-fan/bug.
* [SPARK-13166][SQL] Rename DataStreamReaderWriterSuite to ↵Shixiong Zhu2016-02-051-1/+1
| | | | | | | | | | DataFrameReaderWriterSuite A follow up PR for #11062 because it didn't rename the test suite. Author: Shixiong Zhu <shixiong@databricks.com> Closes #11096 from zsxwing/rename.
* [SPARK-13187][SQL] Add boolean/long/double options in DataFrameReader/WriterReynold Xin2016-02-043-0/+67
| | | | | | | | | | | | | | | | | | | This patch adds option function for boolean, long, and double types. This makes it slightly easier for Spark users to specify options without turning them into strings. Using the JSON data source as an example. Before this patch: ```scala sqlContext.read.option("primitivesAsString", "true").json("/path/to/json") ``` After this patch: Before this patch: ```scala sqlContext.read.option("primitivesAsString", true).json("/path/to/json") ``` Author: Reynold Xin <rxin@databricks.com> Closes #11072 from rxin/SPARK-13187.
* [SPARK-13208][CORE] Replace use of Pairs with Tuple2sJakob Odersky2016-02-042-4/+4
| | | | | | | | Another trivial deprecation fix for Scala 2.11 Author: Jakob Odersky <jakob@odersky.com> Closes #11089 from jodersky/SPARK-13208.
* [SPARK-12850][SQL] Support Bucket Pruning (Predicate Pushdown for Bucketed ↵gatorsmile2016-02-043-10/+245
| | | | | | | | | | | | | | | | | | | | Tables) JIRA: https://issues.apache.org/jira/browse/SPARK-12850 This PR is to support bucket pruning when the predicates are `EqualTo`, `EqualNullSafe`, `IsNull`, `In`, and `InSet`. Like HIVE, in this PR, the bucket pruning works when the bucketing key has one and only one column. So far, I do not find a way to verify how many buckets are actually scanned. However, I did verify it when doing the debug. Could you provide a suggestion how to do it properly? Thank you! cloud-fan yhuai rxin marmbrus BTW, we can add more cases to support complex predicate including `Or` and `And`. Please let me know if I should do it in this PR. Maybe we also need to add test cases to verify if bucket pruning works well for each data type. Author: gatorsmile <gatorsmile@gmail.com> Closes #10942 from gatorsmile/pruningBuckets.
* [SPARK-13079][SQL] InMemoryCatalog follow-upsAndrew Or2016-02-042-5/+22
| | | | | | | | This patch incorporates review feedback from #11069, which is already merged. Author: Andrew Or <andrew@databricks.com> Closes #11080 from andrewor14/catalog-follow-ups.
* [SPARK-13168][SQL] Collapse adjacent repartition operatorsJosh Rosen2016-02-046-10/+33
| | | | | | | | Spark SQL should collapse adjacent `Repartition` operators and only keep the last one. Author: Josh Rosen <joshrosen@databricks.com> Closes #11064 from JoshRosen/collapse-repartition.
* [SPARK-12828][SQL] Natural join follow-upReynold Xin2016-02-033-12/+17
| | | | | | | | This is a small addendum to #10762 to make the code more robust again future changes. Author: Reynold Xin <rxin@databricks.com> Closes #11070 from rxin/SPARK-12828-natural-join.
* [SPARK-12828][SQL] add natural join supportDaoyuan Wang2016-02-0311-11/+198
| | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-12828 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #10762 from adrian-wang/naturaljoin.
* [SPARK-13079][SQL] Extend and implement InMemoryCatalogAndrew Or2016-02-033-47/+328
| | | | | | | | | | | | This is a step towards consolidating `SQLContext` and `HiveContext`. This patch extends the existing Catalog API added in #10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested. About 200 lines are test code. Author: Andrew Or <andrew@databricks.com> Closes #11069 from andrewor14/catalog.
* [SPARK-13152][CORE] Fix task metrics deprecation warningHolden Karau2016-02-031-2/+2
| | | | | | | | | | | | Make an internal non-deprecated version of incBytesRead and incRecordsRead so we don't have unecessary deprecation warnings in our build. Right now incBytesRead and incRecordsRead are marked as deprecated and for internal use only. We should make private[spark] versions which are not deprecated and switch to those internally so as to not clutter up the warning messages when building. cc andrewor14 who did the initial deprecation Author: Holden Karau <holden@us.ibm.com> Closes #11056 from holdenk/SPARK-13152-fix-task-metrics-deprecation-warnings.
* [SPARK-13131] [SQL] Use best and average time in benchmarkDavies Liu2016-02-031-89/+65
| | | | | | | | | | | | | | | | | | | | | Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query). Having best time and average time together for more information (we can see kind of variance). rate, time per row and relative are all calculated using best time. The result looks like this: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz rang/filter/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- rang/filter/sum codegen=false 14332 / 16646 36.0 27.8 1.0X rang/filter/sum codegen=true 845 / 940 620.0 1.6 17.0X ``` Author: Davies Liu <davies@databricks.com> Closes #11018 from davies/gen_bench.
* [SPARK-13166][SQL] Remove DataStreamReader/WriterReynold Xin2016-02-038-315/+86
| | | | | | | | | | | | | | They seem redundant and we can simply use DataFrameReader/Writer. The new usage looks like: ```scala val df = sqlContext.read.stream("...") val handle = df.write.stream("...") handle.stop() ``` Author: Reynold Xin <rxin@databricks.com> Closes #11062 from rxin/SPARK-13166.
* [SPARK-13157] [SQL] Support any kind of input for SQL commands.Herman van Hovell2016-02-034-6/+46
| | | | | | | | | | | | The ```SparkSqlLexer``` currently swallows characters which have not been defined in the grammar. This causes problems with SQL commands, such as: ```add jar file:///tmp/ab/TestUDTF.jar```. In this example the `````` is swallowed. This PR adds an extra Lexer rule to handle such input, and makes a tiny modification to the ```ASTNode```. cc davies liancheng Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11052 from hvanhovell/SPARK-13157.
* [SPARK-12798] [SQL] generated BroadcastHashJoinDavies Liu2016-02-037-20/+169
| | | | | | | | A row from stream side could match multiple rows on build side, the loop for these matched rows should not be interrupted when emitting a row, so we buffer the output rows in a linked list, check the termination condition on producer loop (for example, Range or Aggregate). Author: Davies Liu <davies@databricks.com> Closes #10989 from davies/gen_join.
* [SPARK-12957][SQL] Initial support for constraint propagation in SparkSQLSameer Agarwal2016-02-024-7/+302
| | | | | | | | | | | | | | | Based on the semantics of a query, we can derive a number of data constraints on output of each (logical or physical) operator. For instance, if a filter defines `‘a > 10`, we know that the output data of this filter satisfies 2 constraints: 1. `‘a > 10` 2. `isNotNull(‘a)` This PR proposes a possible way of keeping track of these constraints and propagating them in the logical plan, which can then help us build more advanced optimizations (such as pruning redundant filters, optimizing joins, among others). We define constraints as a set of (implicitly conjunctive) expressions. For e.g., if a filter operator has constraints = `Set(‘a > 10, ‘b < 100)`, it’s implied that the outputs satisfy both individual constraints (i.e., `‘a > 10` AND `‘b < 100`). Design Document: https://docs.google.com/a/databricks.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit?usp=sharing Author: Sameer Agarwal <sameer@databricks.com> Closes #10844 from sameeragarwal/constraints.
* [SPARK-13147] [SQL] improve readability of generated codeDavies Liu2016-02-027-39/+63
| | | | | | | | | | | 1. try to avoid the suffix (unique id) 2. remove the comment if there is no code generated. 3. re-arrange the order of functions 4. trop the new line for inlined blocks. Author: Davies Liu <davies@databricks.com> Closes #11032 from davies/better_suffix.
* [SPARK-12951] [SQL] support spilling in generated aggregateDavies Liu2016-02-021-30/+142
| | | | | | | | | | | | This PR add spilling support for generated TungstenAggregate. If spilling happened, it's not that bad to do the iterator based sort-merge-aggregate (not generated). The changes will be covered by TungstenAggregationQueryWithControlledFallbackSuite Author: Davies Liu <davies@databricks.com> Closes #10998 from davies/gen_spilling.
* [SPARK-12992] [SQL] Update parquet reader to support more types when ↵Nong Li2016-02-026-21/+424
| | | | | | | | | | | | | | | | | decoding to ColumnarBatch. This patch implements support for more types when doing the vectorized decode. There are a few more types remaining but they should be very straightforward after this. This code has a few copy and paste pieces but they are difficult to eliminate due to performance considerations. Specifically, this patch adds support for: - String, Long, Byte types - Dictionary encoding for those types. Author: Nong Li <nong@databricks.com> Closes #10908 from nongli/spark-12992.