spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle rule	Dongjoon Hyun	2016-03-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables LineLength checkstyle again. To help that, this also introduces RedundantImport and RedundantModifier, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.\|^import.\|a href\|href\|http://\|https://\|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.
*	[SPARK-14019][SQL] Remove noop SortOrder in Sort	gatorsmile	2016-03-21	2	-0/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR is to add a new Optimizer rule for pruning Sort if its SortOrder is no-op. In the phase of Optimizer, if a specific `SortOrder` does not have any reference, it has no effect on the sorting results. If `Sort` is empty, remove the whole `Sort`. For example, in the following SQL query ```SQL SELECT * FROM t ORDER BY NULL + 5 ``` Before the fix, the plan is like ``` == Analyzed Logical Plan == a: int, b: int Sort [(cast(null as int) + 5) ASC], true +- Project [a#92,b#93] +- SubqueryAlias t +- Project [_1#89 AS a#92,_2#90 AS b#93] +- LocalRelation [_1#89,_2#90], [[1,2],[1,2]] == Optimized Logical Plan == Sort [null ASC], true +- LocalRelation [a#92,b#93], [[1,2],[1,2]] == Physical Plan == WholeStageCodegen : +- Sort [null ASC], true, 0 : +- INPUT +- Exchange rangepartitioning(null ASC, 5), None +- LocalTableScan [a#92,b#93], [[1,2],[1,2]] ``` After the fix, the plan is like ``` == Analyzed Logical Plan == a: int, b: int Sort [(cast(null as int) + 5) ASC], true +- Project [a#92,b#93] +- SubqueryAlias t +- Project [_1#89 AS a#92,_2#90 AS b#93] +- LocalRelation [_1#89,_2#90], [[1,2],[1,2]] == Optimized Logical Plan == LocalRelation [a#92,b#93], [[1,2],[1,2]] == Physical Plan == LocalTableScan [a#92,b#93], [[1,2],[1,2]] ``` cc rxin cloud-fan marmbrus Thanks! #### How was this patch tested? Added a test suite for covering this rule Author: gatorsmile <gatorsmile@gmail.com> Closes #11840 from gatorsmile/sortElimination.
*	[SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the ↵	Cheng Lian	2016-03-19	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	first qualifier to generate SQL strings ## What changes were proposed in this pull request? Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`. This PR fixes this issue by only picking the first qualifiers. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests should be enough. Author: Cheng Lian <lian@databricks.com> Closes #11820 from liancheng/spark-14004-single-qualifier.
*	[SPARK-13838] [SQL] Clear variable code to prevent it to be re-evaluated in ↵	Liang-Chi Hsieh	2016-03-17	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	BoundAttribute JIRA: https://issues.apache.org/jira/browse/SPARK-13838 ## What changes were proposed in this pull request? We should also clear the variable code in `BoundReference.genCode` to prevent it to be evaluated twice, as we did in `evaluateVariables`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11674 from viirya/avoid-reevaluate.
*	[SPARK-13427][SQL] Support USING clause in JOIN.	Dilip Biswal	2016-03-17	10	-79/+190
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Support queries that JOIN tables with USING clause. SELECT * from table1 JOIN table2 USING <column_list> USING clause can be used as a means to simplify the join condition when : 1) Equijoin semantics is desired and 2) The column names in the equijoin have the same name. We already have the support for Natural Join in Spark. This PR makes use of the already existing infrastructure for natural join to form the join condition and also the projection list. ## How was the this patch tested? Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11297 from dilipbiswal/spark-13427.
*	[SPARK-13928] Move org.apache.spark.Logging into ↵	Wenchen Fan	2016-03-17	10	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.
*	Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ↵	Davies Liu	2016-03-16	4	-26/+14
\| \| \| \| \| \|	ColumnPruning and EliminateOperator" This reverts commit 99bd2f0e94657687834c5c59c4270c1484c9f595.
*	[SPARK-13118][SQL] Expression encoding for optional synthetic classes	Jakob Odersky	2016-03-16	2	-3/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix expression generation for optional types. Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases. This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR). ## How was this patch tested? A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects. Author: Jakob Odersky <jakob@odersky.com> Closes #11708 from jodersky/SPARK-13118-package-objects.
*	[SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole ↵	Davies Liu	2016-03-16	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	stage codegen ## What changes were proposed in this pull request? We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen. Updated the benchmark for `collect`, we could see 20-30% speedup. ## How was this patch tested? existing unit tests. Author: Davies Liu <davies@databricks.com> Closes #11740 from davies/avoid_copy2.
*	[SPARK-13923][SQL] Implement SessionCatalog	Andrew Or	2016-03-16	8	-141/+1555
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11750 from andrewor14/temp-catalog.
*	[SPARK-11011][SQL] Narrow type of UDT serialization	Jakob Odersky	2016-03-16	4	-30/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type. ## How was this patch tested? Existing tests were successfully run on local machine. Author: Jakob Odersky <jakob@odersky.com> Closes #11379 from jodersky/SPARK-11011-udt-types.
*	[SPARK-13869][SQL] Remove redundant conditions while combining filters	Sameer Agarwal	2016-03-16	2	-4/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? [I'll link it to the JIRA once ASF JIRA is back online] This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`) ## How was this patch tested? Unit test in `FilterPushdownSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11670 from sameeragarwal/combine-filters.
*	[SPARK-13871][SQL] Support for inferring filters from data constraints	Sameer Agarwal	2016-03-16	2	-62/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things: (a) no redundant filters are generated, and (b) filters that do not contribute to any further optimizations are not generated. ## How was this patch tested? Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators. In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization. Author: Sameer Agarwal <sameer@databricks.com> Closes #11665 from sameeragarwal/infer-filters.
*	[SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs ↵	Wenchen Fan	2016-03-16	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs. To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore. For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like: ``` SELECT sq_1.key, sq_1.max FROM ( SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max FROM ( SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y ) AS sq_0 ) AS sq_1 ``` You can see, the `key` columns become ambiguous after `sq_0`. After this PR, it will generate something like: ``` SELECT attr_30 AS key, attr_37 AS max FROM ( SELECT attr_30, attr_37 FROM ( SELECT attr_30, attr_35, MAX(attr_35) AS attr_37 FROM ( SELECT attr_30, attr_35 FROM (SELECT key AS attr_30 FROM t1) AS sq_0 INNER JOIN (SELECT key AS attr_35 FROM t1) AS sq_1 ) AS sq_2 ) AS sq_3 ) AS sq_4 ``` The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore. ## How was this patch tested? existing tests and new tests in LogicalPlanToSQLSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes #11658 from cloud-fan/gensql.
*	[SPARK-13924][SQL] officially support multi-insert	Wenchen Fan	2016-03-16	1	-11/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? There is a feature of hive SQL called multi-insert. For example: ``` FROM src INSERT OVERWRITE TABLE dest1 SELECT key + 1 INSERT OVERWRITE TABLE dest2 SELECT key WHERE key > 2 INSERT OVERWRITE TABLE dest3 SELECT col EXPLODE(arr) exp AS col ... ``` We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE. This PR removes these limitations and make us fully support multi-insert. ## How was this patch tested? new tests in `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11754 from cloud-fan/lateral-view.
*	[SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset ↵	Sean Owen	2016-03-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	follow up ## What changes were proposed in this pull request? Follow up to https://github.com/apache/spark/pull/11657 - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8` - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests) - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11725 from srowen/SPARK-13823.2.
*	[MINOR][TEST][SQL] Remove wrong "expected" parameter in checkNaNWithoutCodegen	Yucai Yu	2016-03-15	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Remove the wrong "expected" parameter in MathFunctionsSuite.scala's checkNaNWithoutCodegen. This function is to check NaN value, so the "expected" parameter is useless. The Callers do not pass "expected" value and the similar function like checkNaNWithGeneratedProjection and checkNaNWithOptimization do not use it also. Author: Yucai Yu <yucai.yu@intel.com> Closes #11718 from yucai/unused_expected.
*	[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and ↵	gatorsmile	2016-03-15	4	-14/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EliminateOperator #### What changes were proposed in this pull request? Before this PR, two Optimizer rules `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra `Project` are found in the plan. For example, below is the optimized plan after reaching 100 iterations: ``` Join Inner, Some((cast(id1#16 as bigint) = id1#18L)) :- Project [id1#16] : +- Filter isnotnull(cast(id1#16 as bigint)) : +- Project [id1#16] : +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int> +- Filter isnotnull(id1#18L) +- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint> ``` This PR splits the optimizer rule `ColumnPruning` to `ColumnPruning` and `EliminateOperators` The issue becomes worse when having another rule `NullFiltering`, which could add extra Filters for `IsNotNull`. We have to be careful when introducing extra `Filter` if the benefit is not large enough. Another PR will be submitted by sameeragarwal to handle this issue. cc sameeragarwal marmbrus In addition, `ColumnPruning` should not push `Project` through non-deterministic `Filter`. This could cause wrong results. This will be put in a separate PR. cc davies cloud-fan yhuai #### How was this patch tested? Modified the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #11682 from gatorsmile/viewDuplicateNames.
*	[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed ↵	Michael Armbrust	2016-03-14	2	-0/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	scans of files This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes #11646 from marmbrus/fileStrategy.
*	[SPARK-13658][SQL] BooleanSimplification rule is slow with large boolean ↵	Liang-Chi Hsieh	2016-03-14	2	-29/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	expressions JIRA: https://issues.apache.org/jira/browse/SPARK-13658 ## What changes were proposed in this pull request? Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree). It will great if we could speedup it. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql How to speed up it: When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time. By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions. I simply benchmark it with an expression which is part of the where clause in TPCDS Q3: val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int) val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) \|\| ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) \|\| ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) \|\| ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) \|\| ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) \|\| ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) \|\| ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) \|\| ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) \|\| ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) \|\| ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) \|\| ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) \|\| ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) \|\| ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) \|\| ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) \|\| ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) \|\| ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) \|\| ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) \|\| ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) \|\| ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) \|\| ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) \|\| ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) \|\| ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) \|\| ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) \|\| ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) \|\| ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) \|\| ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) \|\| ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) \|\| ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) \|\| ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) \|\| ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) \|\| ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) \|\| ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) \|\| ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) \|\| ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) \|\| ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) \|\| ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) \|\| ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) \|\| ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) \|\| ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) \|\| ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) \|\| ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) \|\| ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) \|\| ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) \|\| ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) \|\| ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) \|\| ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) \|\| ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) \|\| ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) \|\| ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) \|\| ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) \|\| ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) \|\| ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) \|\| ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) \|\| ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743))) val plan = testRelation.where(input).analyze val actual = Optimize.execute(plan) With this patch: 352 milliseconds 346 milliseconds 340 milliseconds Without this patch: 585 milliseconds 880 milliseconds 677 milliseconds ## How was this patch tested? Existing tests should pass. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11647 from viirya/improve-expr-canonicalize.
*	[MINOR][DOCS] Fix more typos in comments/strings.	Dongjoon Hyun	2016-03-14	17	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.
*	[SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> ↵	Sean Owen	2016-03-13	8	-19/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.
*	[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and ↵	Davies Liu	2016-03-12	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	data sources ## What changes were proposed in this pull request? This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them. Also fix the problem for sameResult() on two DataSourceScan. Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad). ## How was this patch tested? Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan). Author: Davies Liu <davies@databricks.com> Closes #11514 from davies/existing_rdd.
*	[SPARK-13139][SQL] Parse Hive DDL commands ourselves	Andrew Or	2016-03-11	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`. Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog. ## How was this patch tested? Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here. Author: Andrew Or <andrew@databricks.com> Closes #11573 from andrewor14/parser-plus-plus.
*	[SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions	Wenchen Fan	2016-03-11	3	-9/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc. This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up. Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here. Many thanks to gatorsmile for the initial discussion and test cases! ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11555 from cloud-fan/window.
*	[SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and Eliminate ↵	gatorsmile	2016-03-11	5	-21/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	useless Window #### What changes were proposed in this pull request? `projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer. This PR is based on the discussion started by cloud-fan in a separate PR: https://github.com/apache/spark/pull/5604#discussion_r55140466 This PR also eliminates useless `Window`. cloud-fan yhuai #### How was this patch tested? Existing test cases cover it. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11565 from gatorsmile/removeProjListWindow.
*	[SPARK-13789] Infer additional constraints from attribute equality	Sameer Agarwal	2016-03-10	2	-0/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5` ## How was this patch tested? Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test) Author: Sameer Agarwal <sameer@databricks.com> Closes #11618 from sameeragarwal/infer-isequal-constraints.
*	[SPARK-13244][SQL] Migrates DataFrame to Dataset	Cheng Lian	2016-03-10	3	-11/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`. Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`). There are several noticeable API changes related to those returning arrays: 1. `collect`/`take` - Old APIs in class `DataFrame`: ```scala def collect(): Array[Row] def take(n: Int): Array[Row] ``` - New APIs in class `Dataset[T]`: ```scala def collect(): Array[T] def take(n: Int): Array[T] def collectRows(): Array[Row] def takeRows(n: Int): Array[Row] ``` Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side. Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here). 1. `randomSplit` - Old APIs in class `DataFrame`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] def randomSplit(weights: Array[Double]): Array[DataFrame] ``` - New APIs in class `Dataset[T]`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] def randomSplit(weights: Array[Double]): Array[Dataset[T]] ``` Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one. 1. `groupBy` Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`. Other noticeable changes: 1. Dataset always do eager analysis now We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`. ## How was this patch tested? Existing tests do the work. ## TODO - [ ] Fix all tests - [ ] Re-enable MiMA check - [ ] Update ScalaDoc (`since`, `group`, and example code) Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #11443 from liancheng/ds-to-df.
*	[SPARK-3854][BUILD] Scala style: require spaces before `{`.	Dongjoon Hyun	2016-03-10	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time. ``` // Correct: if (true) { println("Wow!") } // Incorrect: if (true){ println("Wow!") } ``` IntelliJ also shows new warnings based on this. ## How was this patch tested? Pass the Jenkins ScalaStyle test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11637 from dongjoon-hyun/SPARK-3854.
*	[SPARK-13790] Speed up ColumnVector's getDecimal	Nong Li	2016-03-10	2	-1/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We should reuse an object similar to the other non-primitive type getters. For a query that computes averages over decimal columns, this shows a 10% speedup on overall query times. ## How was this patch tested? Existing tests and this benchmark ``` TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------- q27-agg (master) 10627 / 11057 10.8 92.3 q27-agg (this patch) 9722 / 9832 11.8 84.4 ``` Author: Nong Li <nong@databricks.com> Closes #11624 from nongli/spark-13790.
*	[SPARK-13759][SQL] Add IsNotNull constraints for expressions with an inequality	Sameer Agarwal	2016-03-10	3	-4/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null. ## How was this patch tested? 1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality. 2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input. cc nongli Author: Sameer Agarwal <sameer@databricks.com> Closes #11594 from sameeragarwal/isnotequal-constraints.
*	Revert "[SPARK-13760][SQL] Fix BigDecimal constructor for FloatType"	Yin Huai	2016-03-09	1	-1/+1
\| \| \| \|	This reverts commit 926e9c45a21c5b71ef0832d63b8dae7d4f3d8826.
*	[SPARK-13760][SQL] Fix BigDecimal constructor for FloatType	Sameer Agarwal	2016-03-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? A very minor change for using `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The latter is deprecated and can result in inconsistencies due to an implicit conversion to `Double`. ## How was this patch tested? N/A cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #11597 from sameeragarwal/bigdecimal.
*	[SPARK-13781][SQL] Use ExpressionSets in ConstraintPropagationSuite	Sameer Agarwal	2016-03-09	1	-25/+25
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR is a small follow up on https://github.com/apache/spark/pull/11338 (https://issues.apache.org/jira/browse/SPARK-13092) to use `ExpressionSet` as part of the verification logic in `ConstraintPropagationSuite`. ## How was this patch tested? No new tests added. Just changes the verification logic in `ConstraintPropagationSuite`. Author: Sameer Agarwal <sameer@databricks.com> Closes #11611 from sameeragarwal/expression-set.
*	[SPARK-13527][SQL] Prune Filters based on Constraints	gatorsmile	2016-03-09	4	-7/+159
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints. For example, the first query can be simplified to the second one. ```scala val queryWithUselessFilter = tr1 .where("tr1.a".attr > 10 \|\| "tr1.c".attr < 10) .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr)) .where( ("tr1.a".attr > 10 \|\| "tr1.c".attr < 10) && 'd.attr < 100 && "tr2.a".attr === "tr1.a".attr) ``` ```scala val query = tr1 .where("tr1.a".attr > 10 \|\| "tr1.c".attr < 10) .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr)) ``` #### How was this patch tested? Six test cases are added. Author: gatorsmile <gatorsmile@gmail.com> Closes #11406 from gatorsmile/FilterRemoval.
*	[SPARK-13523] [SQL] Reuse exchanges in a query	Davies Liu	2016-03-09	3	-55/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache). Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query. In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan. Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning. After the rule, the plan will looks like: ``` WholeStageCodegen : +- Project [id#0L] : +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None : :- Project [id#0L] : : +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None : : :- Range 0, 1, 4, 1024, [id#0L] : : +- INPUT : +- INPUT :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) : +- WholeStageCodegen : : +- Range 0, 1, 4, 1024, [id#1L] +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L)) ``` ![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png) For three ways SortMergeJoin, ``` == Physical Plan == WholeStageCodegen : +- Project [id#0L] : +- SortMergeJoin [id#0L], [id#4L], None : :- INPUT : +- INPUT :- WholeStageCodegen : : +- Project [id#0L] : : +- SortMergeJoin [id#0L], [id#3L], None : : :- INPUT : : +- INPUT : :- WholeStageCodegen : : : +- Sort [id#0L ASC], false, 0 : : : +- INPUT : : +- Exchange hashpartitioning(id#0L, 200), None : : +- WholeStageCodegen : : : +- Range 0, 1, 4, 33554432, [id#0L] : +- WholeStageCodegen : : +- Sort [id#3L ASC], false, 0 : : +- INPUT : +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None +- WholeStageCodegen : +- Sort [id#4L ASC], false, 0 : +- INPUT +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None ``` ![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png) If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents. ## How was this patch tested? Added some unit tests for this. Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ). Author: Davies Liu <davies@databricks.com> Closes #11403 from davies/dedup.
*	[SPARK-13763][SQL] Remove Project when its Child's Output is Nil	gatorsmile	2016-03-09	2	-3/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? As shown in another PR: https://github.com/apache/spark/pull/11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example, ```SQL SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value ``` Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below: ``` == Analyzed Logical Plan == value: int Project [value#22] +- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22] +- SubqueryAlias dummyTable +- Project [1 AS 1#21] +- OneRowRelation$ == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- Project +- OneRowRelation$ ``` After the fix, the optimized plan removed the useless `Project`, as shown below: ``` == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- OneRowRelation$ ``` This PR is to remove `Project` when its Child's output is Nil #### How was this patch tested? Added a new unit test case into the suite `ColumnPruningSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes #11599 from gatorsmile/projectOneRowRelation.
*	Revert "[SPARK-13668][SQL] Reorder filter/join predicates to short-circuit ↵	Davies Liu	2016-03-09	1	-23/+1
\| \| \| \| \| \|	isNotNull checks" This reverts commit e430614eae53c8864b31a1dc64db83e27100d1d9.
*	[SPARK-13242] [SQL] codegen fallback in case-when if there many branches	Davies Liu	2016-03-09	3	-4/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? If there are many branches in a CaseWhen expression, the generated code could go above the 64K limit for single java method, will fail to compile. This PR change it to fallback to interpret mode if there are more than 20 branches. This PR is based on #11243 and #11221, thanks to joehalliwell Closes #11243 Closes #11221 ## How was this patch tested? Add a test with 50 branches. Author: Davies Liu <davies@databricks.com> Closes #11592 from davies/fix_when.
*	[SPARK-13698][SQL] Fix Analysis Exceptions when Using Backticks in Generate	Dilip Biswal	2016-03-09	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Analysis exception occurs while running the following query. ``` SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints` ``` ``` Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7 'Project ['ints] +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8] +- SubqueryAlias nestedarray +- LocalRelation [a#0], [[[[1,2,3]]]] ``` ## How was this patch tested? Added new unit tests in SQLQuerySuite and HiveQlSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11538 from dilipbiswal/SPARK-13698.
*	[SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance ↵	Dongjoon Hyun	2016-03-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	creation in Java code. ## What changes were proposed in this pull request? In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator. ``` - final ArrayList<Product2<Object, Object>> dataToWrite = - new ArrayList<Product2<Object, Object>>(); + final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>(); ``` Java 7 or higher supports diamond operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this. ## How was this patch tested? Manual. Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11541 from dongjoon-hyun/SPARK-13702.
*	[SPARK-13640][SQL] Synchronize ScalaReflection.mirror method.	Takuya UESHIN	2016-03-09	2	-2/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `ScalaReflection.mirror` method should be synchronized when scala version is `2.10` because `universe.runtimeMirror` is not thread safe. ## How was this patch tested? I added a test to check thread safety of `ScalaRefection.mirror` method in `ScalaReflectionSuite`, which will throw the following Exception in Scala `2.10` without this patch: ``` [info] - thread safety of mirror * FAILED * (49 milliseconds) [info] java.lang.UnsupportedOperationException: tail of empty list [info] at scala.collection.immutable.Nil$.tail(List.scala:339) [info] at scala.collection.immutable.Nil$.tail(List.scala:334) [info] at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172) [info] at scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477) [info] at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777) [info] at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235) [info] at scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34) [info] at scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61) [info] at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12) [info] at scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12) [info] at org.apache.spark.sql.catalyst.ScalaReflection$.mirror(ScalaReflection.scala:36) [info] at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:256) [info] at org.apache.spark.sql.catalyst.ScalaReflectionSuite$$anonfun$12$$anonfun$apply$mcV$sp$1$$anonfun$apply$1$$anonfun$apply$2.apply(ScalaReflectionSuite.scala:252) [info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) [info] at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) [info] at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) [info] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [info] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [info] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [info] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ``` Notice that the test will pass when Scala version is `2.11`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #11487 from ueshin/issues/SPARK-13640.
*	[SPARK-7286][SQL] Deprecate !== in favour of =!=	Jakob Odersky	2016-03-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR replaces #9925 which had issues with CI. Please see the original PR for any previous discussions. ## What changes were proposed in this pull request? Deprecate the SparkSQL column operator !== and use =!= as an alternative. Fixes subtle issues related to operator precedence (basically, !== does not have the same priority as its logical negation, ===). ## How was this patch tested? All currently existing tests. Author: Jakob Odersky <jodersky@gmail.com> Closes #11588 from jodersky/SPARK-7286.
*	[SPARK-13668][SQL] Reorder filter/join predicates to short-circuit isNotNull ↵	Sameer Agarwal	2016-03-08	1	-1/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checks ## What changes were proposed in this pull request? If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates. For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation. ## How was this patch tested? new unit tests that verify the physical plan for both filters and joins in `ReorderedPredicateSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11511 from sameeragarwal/reorder-isnotnull.
*	[SPARK-13400] Stop using deprecated Octal escape literals	Dongjoon Hyun	2016-03-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This removes the remaining deprecated Octal escape literals. The followings are the warnings on those two lines. ``` LiteralExpressionSuite.scala:99: Octal escape literals are deprecated, use \u0000 instead. HiveQlSuite.scala:74: Octal escape literals are deprecated, use \u002c instead. ``` ## How was this patch tested? Manual. During building, there should be no warning on `Octal escape literals`. ``` mvn -DskipTests clean install ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11584 from dongjoon-hyun/SPARK-13400.
*	[SPARK-12727][SQL] support SQL generation for aggregate with multi-distinct	Wenchen Fan	2016-03-08	3	-9/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR add SQL generation support for aggregate with multi-distinct, by simply moving the `DistinctAggregationRewriter` rule to optimizer. More discussions are needed as this breaks an import contract: analyzed plan should be able to run without optimization. However, the `ComputeCurrentTime` rule has kind of broken it already, and I think maybe we should add a new phase for this kind of rules, because strictly speaking they don't belong to analysis and is coupled with the physical plan implementation. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11579 from cloud-fan/distinct.
*	[SPARK-13657] [SQL] Support parsing very long AND/OR expressions	Davies Liu	2016-03-08	2	-2/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In order to avoid StackOverflow when parse a expression with hundreds of ORs, we should use loop instead of recursive functions to flatten the tree as list. This PR also build a balanced tree to reduce the depth of generated And/Or expression, to avoid StackOverflow in analyzer/optimizer. ## How was this patch tested? Add new unit tests. Manually tested with TPCDS Q3 with hundreds predicates in it [1]. These predicates help to reduce the number of partitions, then the query time went from 60 seconds to 8 seconds. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql Author: Davies Liu <davies@databricks.com> Closes #11501 from davies/long_or.
*	[SPARK-13637][SQL] use more information to simplify the code in Expand builder	Wenchen Fan	2016-03-08	2	-29/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The code in `Expand.apply` can be simplified by existing information: * the `groupByExprs` parameter are all `Attribute`s * the `child` parameter is a `Project` that append aliased group by expressions to its child's output ## How was this patch tested? by existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11485 from cloud-fan/expand.
*	[SPARK-13404] [SQL] Create variables for input row when it's actually used	Davies Liu	2016-03-07	2	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR change the way how we generate the code for the output variables passing from a plan to it's parent. Right now, they are generated before call consume() of it's parent. It's not efficient, if the parent is a Filter or Join, which could filter out most the rows, the time to access some of the columns that are not used by the Filter or Join are wasted. This PR try to improve this by defering the access of columns until they are actually used by a plan. After this PR, a plan does not need to generate code to evaluate the variables for output, just passing the ExprCode to its parent by `consume()`. In `parent.consumeChild()`, it will check the output from child and `usedInputs`, generate the code for those columns that is part of `usedInputs` before calling `doConsume()`. This PR also change the `if` from ``` if (cond) { xxx } ``` to ``` if (!cond) continue; xxx ``` The new one could help to reduce the nested indents for multiple levels of Filter and BroadcastHashJoin. It also added some comments for operators. ## How was the this patch tested? Unit tests. Manually ran TPCDS Q55, this PR improve the performance about 30% (scale=10, from 2.56s to 1.96s) Author: Davies Liu <davies@databricks.com> Closes #11274 from davies/gen_defer.
*	[SPARK-13689][SQL] Move helper things in CatalystQl to new utils object	Andrew Or	2016-03-07	2	-141/+171
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object. This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller. ## How was this patch tested? No change in functionality, so just Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11529 from andrewor14/parser-utils.