spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-10747][SQL] Support NULLS FIRST\|LAST clause in ORDER BY	Xin Wu	2016-09-14	8	-26/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, ORDER BY clause returns nulls value according to sorting order (ASC\|DESC), considering null value is always smaller than non-null values. However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC\|DESC). This PR is to support this new feature. ## How was this patch tested? New test cases are added to test NULLS FIRST\|LAST for regular select queries and windowing queries. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Xin Wu <xinwu@us.ibm.com> Closes #14842 from xwu0226/SPARK-10747.
*	[SPARK-17409][SQL] Do Not Optimize Query in CTAS More Than Once	gatorsmile	2016-09-14	2	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? As explained in https://github.com/apache/spark/pull/14797: >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. We should not optimize the query in CTAS more than once. For example, ```Scala spark.range(99, 101).createOrReplaceTempView("tab1") val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1" sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt") checkAnswer(spark.table("tab2"), sql(sqlStmt)) ``` Before this PR, the results do not match ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![100,100.000000000000000000] [100,null] [99,99.000000000000000000] [99,99.000000000000000000] ``` After this PR, the results match. ``` +---+----------------------+ \|id \|num \| +---+----------------------+ \|99 \|99.000000000000000000 \| \|100\|100.000000000000000000\| +---+----------------------+ ``` In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`. ### How was this patch tested? Added a test Author: gatorsmile <gatorsmile@gmail.com> Closes #15048 from gatorsmile/ctasOptimized.
*	[SPARK-17530][SQL] Add Statistics into DESCRIBE FORMATTED	gatorsmile	2016-09-14	2	-8/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Statistics is missing in the output of `DESCRIBE FORMATTED`. This PR is to add it. After the PR, the output will be like: ``` +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+ \|col_name \|data_type \|comment\| +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+ \|key \|string \|null \| \|value \|string \|null \| \| \| \| \| \|# Detailed Table Information\| \| \| \|Database: \|default \| \| \|Owner: \|xiaoli \| \| \|Create Time: \|Tue Sep 13 14:36:57 PDT 2016 \| \| \|Last Access Time: \|Wed Dec 31 16:00:00 PST 1969 \| \| \|Location: \|file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-9982e1db-df17-4376-a140-dbbee0203d83/texttable\| \| \|Table Type: \|MANAGED \| \| \|Statistics: \|sizeInBytes=5812, rowCount=500, isBroadcastable=false \| \| \|Table Parameters: \| \| \| \| rawDataSize \|-1 \| \| \| numFiles \|1 \| \| \| transient_lastDdlTime \|1473802620 \| \| \| totalSize \|5812 \| \| \| COLUMN_STATS_ACCURATE \|false \| \| \| numRows \|-1 \| \| \| \| \| \| \|# Storage Information \| \| \| \|SerDe Library: \|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe \| \| \|InputFormat: \|org.apache.hadoop.mapred.TextInputFormat \| \| \|OutputFormat: \|org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat \| \| \|Compressed: \|No \| \| \|Storage Desc Parameters: \| \| \| \| serialization.format \|1 \| \| +----------------------------+----------------------------------------------------------------------------------------------------------------------+-------+ ``` Also improve the output of statistics in `DESCRIBE EXTENDED` by removing duplicate `Statistics`. Below is the example after the PR: ``` +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+ \|col_name \|data_type \|comment\| +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+ \|key \|string \|null \| \|value \|string \|null \| \| \| \| \| \|# Detailed Table Information\|CatalogTable( Table: `default`.`texttable` Owner: xiaoli Created: Tue Sep 13 14:38:43 PDT 2016 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: MANAGED Schema: [StructField(key,StringType,true), StructField(value,StringType,true)] Provider: hive Properties: [rawDataSize=-1, numFiles=1, transient_lastDdlTime=1473802726, totalSize=5812, COLUMN_STATS_ACCURATE=false, numRows=-1] Statistics: sizeInBytes=5812, rowCount=500, isBroadcastable=false Storage(Location: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-8ea5c5a0-5680-4778-91cb-c6334cf8a708/texttable, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [serialization.format=1]))\| \| +----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+ ``` ### How was this patch tested? Manually tested. Author: gatorsmile <gatorsmile@gmail.com> Closes #15083 from gatorsmile/descFormattedStats.
*	[SPARK-17142][SQL] Complex query triggers binding error in HashAggregateExec	jiangxingbo	2016-09-13	2	-8/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In `ReorderAssociativeOperator` rule, we extract foldable expressions with Add/Multiply arithmetics, and replace with eval literal. For example, `(a + 1) + (b + 2)` is optimized to `(a + b + 3)` by this rule. For aggregate operator, output expressions should be derived from groupingExpressions, current implemenation of `ReorderAssociativeOperator` rule may break this promise. A instance could be: ``` SELECT ((t1.a + 1) + (t2.a + 2)) AS out_col FROM testdata2 AS t1 INNER JOIN testdata2 AS t2 ON (t1.a = t2.a) GROUP BY (t1.a + 1), (t2.a + 2) ``` `((t1.a + 1) + (t2.a + 2))` is optimized to `(t1.a + t2.a + 3)`, which could not be derived from `ExpressionSet((t1.a +1), (t2.a + 2))`. Maybe we should improve the rule of `ReorderAssociativeOperator` by adding a GroupingExpressionSet to keep Aggregate.groupingExpressions, and respect these expressions during the optimize stage. ## How was this patch tested? Add new test case in `ReorderAssociativeOperatorSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #14917 from jiangxb1987/rao.
*	[SPARK-17439][SQL] Fixing compression issues with approximate quantiles and ↵	Timothy Hunter	2016-09-11	2	-6/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	adding more tests ## What changes were proposed in this pull request? This PR build on #14976 and fixes a correctness bug that would cause the wrong quantile to be returned for small target errors. ## How was this patch tested? This PR adds 8 unit tests that were failing without the fix. Author: Timothy Hunter <timhunter@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #15002 from thunterdb/ml-1783.
*	[SPARK-17405] RowBasedKeyValueBatch should use default page size to prevent OOMs	Eric Liang	2016-09-08	1	-8/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Before this change, we would always allocate 64MB per aggregation task for the first-level hash map storage, even when running in low-memory situations such as local mode. This changes it to use the memory manager default page size, which is automatically reduced from 64MB in these situations. cc ooq JoshRosen ## How was this patch tested? Tested manually with `bin/spark-shell --master=local[32]` and verifying that `(1 to math.pow(10, 3).toInt).toDF("n").withColumn("m", 'n % 2).groupBy('m).agg(sum('n)).show` does not crash. Author: Eric Liang <ekl@databricks.com> Closes #15016 from ericl/sc-4483.
*	[MINOR][SQL] Fixing the typo in unit test	Srinivasa Reddy Vundela	2016-09-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fixing the typo in the unit test of CodeGenerationSuite.scala ## How was this patch tested? Ran the unit test after fixing the typo and it passes Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #14989 from vundela/typo_fix.
*	[SPARK-17427][SQL] function SIZE should return -1 when parameter is null	Daoyuan Wang	2016-09-07	2	-8/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `select size(null)` returns -1 in Hive. In order to be compatible, we should return `-1`. ## How was this patch tested? unit test in `CollectionFunctionsSuite` and `DataFrameFunctionsSuite`. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #14991 from adrian-wang/size.
*	[SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ↵	Liwei Lin	2016-09-07	4	-13/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ArrayBuffer.append(A) in performance critical paths ## What changes were proposed in this pull request? We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing. ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #14914 from lw-lin/append_to_plus_eq_v2.
*	[SPARK-17296][SQL] Simplify parser join processing.	Herman van Hovell	2016-09-07	4	-58/+102
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Join processing in the parser relies on the fact that the grammar produces a right nested trees, for instance the parse tree for `select * from a join b join c` is expected to produce a tree similar to `JOIN(a, JOIN(b, c))`. However there are cases in which this (invariant) is violated, like: ```sql SELECT COUNT(1) FROM test T1 CROSS JOIN test T2 JOIN test T3 ON T3.col = T1.col JOIN test T4 ON T4.col = T1.col ``` In this case the parser returns a tree in which Joins are located on both the left and the right sides of the parent join node. This PR introduces a different grammar rule which does not make this assumption. The new rule takes a relation and searches for zero or more joined relations. As a bonus processing is much easier. ## How was this patch tested? Existing tests and I have added a regression test to the plan parser suite. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14867 from hvanhovell/SPARK-17296.
*	[SPARK-17356][SQL] Fix out of memory issue when generating JSON for TreeNode	Sean Zhong	2016-09-06	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? class `org.apache.spark.sql.types.Metadata` is widely used in mllib to store some ml attributes. `Metadata` is commonly stored in `Alias` expression. ``` case class Alias(child: Expression, name: String)( val exprId: ExprId = NamedExpression.newExprId, val qualifier: Option[String] = None, val explicitMetadata: Option[Metadata] = None, override val isGenerated: java.lang.Boolean = false) ``` The `Metadata` can take a big memory footprint since the number of attributes is big ( in scale of million). When `toJSON` is called on `Alias` expression, the `Metadata` will also be converted to a big JSON string. If a plan contains many such kind of `Alias` expressions, it may trigger out of memory error when `toJSON` is called, since converting all `Metadata` references to JSON will take huge memory. With this PR, we will skip scanning Metadata when doing JSON conversion. For a reproducer of the OOM, and analysis, please look at jira https://issues.apache.org/jira/browse/SPARK-17356. ## How was this patch tested? Existing tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #14915 from clockfly/json_oom.
*	[SPARK-17361][SQL] file-based external table without path should not be created	Wenchen Fan	2016-09-06	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Using the public `Catalog` API, users can create a file-based data source table, without giving the path options. For this case, currently we can create the table successfully, but fail when we read it. Ideally we should fail during creation. This is because when we create data source table, we resolve the data source relation without validating path: `resolveRelation(checkPathExist = false)`. Looking back to why we add this trick(`checkPathExist`), it's because when we call `resolveRelation` for managed table, we add the path to data source options but the path is not created yet. So why we add this not-yet-created path to data source options? This PR fix the problem by adding path to options after we call `resolveRelation`. Then we can remove the `checkPathExist` parameter in `DataSource.resolveRelation` and do some related cleanups. ## How was this patch tested? existing tests and new test in `CatalogSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14921 from cloud-fan/check-path.
*	[SPARK-17279][SQL] better error message for exceptions during ScalaUDF execution	Wenchen Fan	2016-09-06	2	-14/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? If `ScalaUDF` throws exceptions during executing user code, sometimes it's hard for users to figure out what's wrong, especially when they use Spark shell. An example ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage 325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40) at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) ... ``` We should catch these exceptions and rethrow them with better error message, to say that the exception is happened in scala udf. This PR also does some clean up for `ScalaUDF` and add a unit test suite for it. ## How was this patch tested? the new test suite Author: Wenchen Fan <wenchen@databricks.com> Closes #14850 from cloud-fan/npe.
*	[SPARK-17072][SQL] support table-level statistics generation and storing ↵	wangzhenhua	2016-09-05	2	-2/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	into/loading from metastore ## What changes were proposed in this pull request? 1. Support generation table-level statistics for - hive tables in HiveExternalCatalog - data source tables in HiveExternalCatalog - data source tables in InMemoryCatalog. 2. Add a property "catalogStats" in CatalogTable to hold statistics in Spark side. 3. Put logics of statistics transformation between Spark and Hive in HiveClientImpl. 4. Extend Statistics class by adding rowCount (will add estimatedSize when we have column stats). ## How was this patch tested? add unit tests Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #14712 from wzhfy/tableStats.
*	[SPARK-17394][SQL] should not allow specify database in table/view name ↵	Wenchen Fan	2016-09-05	2	-33/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	after RENAME TO ## What changes were proposed in this pull request? It's really weird that we allow users to specify database in both from table name and to table name in `ALTER TABLE RENAME TO`, while logically we can't support rename a table to a different database. Both postgres and MySQL disallow this syntax, it's reasonable to follow them and simply our code. ## How was this patch tested? new test in `DDLCommandSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14955 from cloud-fan/rename.
*	[SPARK-17308] Improved the spark core code by replacing all pattern match on ↵	Shivansh	2016-09-04	3	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	boolean value by if/else block. ## What changes were proposed in this pull request? Improved the code quality of spark by replacing all pattern match on boolean value by if/else block. ## How was this patch tested? By running the tests Author: Shivansh <shiv4nsh@gmail.com> Closes #14873 from shiv4nsh/SPARK-17308.
*	[SPARK-17324][SQL] Remove Direct Usage of HiveClient in InsertIntoHiveTable	gatorsmile	2016-09-04	3	-8/+44
\| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? This is another step to get rid of HiveClient from `HiveSessionState`. All the metastore interactions should be through `ExternalCatalog` interface. However, the existing implementation of `InsertIntoHiveTable ` still requires Hive clients. This PR is to remove HiveClient by moving the metastore interactions into `ExternalCatalog`. ### How was this patch tested? Existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14888 from gatorsmile/removeClientFromInsertIntoHiveTable.
*	[SPARK-17335][SQL] Fix ArrayType and MapType CatalogString.	Herman van Hovell	2016-09-03	3	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? the `catalogString` for `ArrayType` and `MapType` currently calls the `simpleString` method on its children. This is a problem when the child is a struct, the `struct.simpleString` implementation truncates the number of fields it shows (25 at max). This breaks the generation of a proper `catalogString`, and has shown to cause errors while writing to Hive. This PR fixes this by providing proper `catalogString` implementations for `ArrayData` or `MapData`. ## How was this patch tested? Added testing for `catalogString` to `DataTypeSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14938 from hvanhovell/SPARK-17335.
*	[SPARK-17298][SQL] Require explicit CROSS join for cartesian products	Srinath Shankar	2016-09-03	16	-53/+169
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the "spark.sql.crossJoin.enabled" configuration flag will disable this check and allow cartesian products without an explicit CROSS join. The new crossJoin DataFrame API must be used to specify explicit cross joins. The existing join(DataFrame) method will produce a INNER join that will require a subsequent join condition. That is df1.join(df2) is equivalent to select * from df1, df2. ## How was this patch tested? Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used. Author: Srinath Shankar <srinath@databricks.com> Closes #14866 from srinathshankar/crossjoin.
*	[SPARK-16935][SQL] Verification of Function-related ExternalCatalog APIs	gatorsmile	2016-09-02	3	-28/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Function-related `HiveExternalCatalog` APIs do not have enough verification logics. After the PR, `HiveExternalCatalog` and `InMemoryCatalog` become consistent in the error handling. For example, below is the exception we got when calling `renameFunction`. ``` 15:13:40.369 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db1, returning NoSuchObjectException 15:13:40.377 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database db2, returning NoSuchObjectException 15:13:40.739 ERROR DataNucleus.Datastore.Persist: Update of object "org.apache.hadoop.hive.metastore.model.MFunction205629e9" using statement "UPDATE FUNCS SET FUNC_NAME=? WHERE FUNC_ID=?" failed : org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUEFUNCTION' defined on 'FUNCS'. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) ``` ### How was this patch tested? Improved the existing test cases to check whether the messages are right. Author: gatorsmile <gatorsmile@gmail.com> Closes #14521 from gatorsmile/functionChecking.
*	[SPARK-16525] [SQL] Enable Row Based HashMap in HashAggregateExec	Qifan Pu	2016-09-01	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR is the second step for the following feature: For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields). In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBatch`. We then automatically pick between the two implementations based on certain knobs. In this second-step PR, we enable `RowBasedHashMapGenerator` in `HashAggregateExec`. ## How was this patch tested? Added tests: `RowBasedAggregateHashMapSuite` and ` VectorizedAggregateHashMapSuite` Additional micro-benchmarks tests and TPCDS results will be added in a separate PR in the series. Author: Qifan Pu <qifan.pu@gmail.com> Author: ooq <qifan.pu@gmail.com> Closes #14176 from ooq/rowbasedfastaggmap-pr2.
*	[SPARK-16732][SQL] Remove unused codes in ↵	Yucai Yu	2016-09-01	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	subexpressionEliminationForWholeStageCodegen ## What changes were proposed in this pull request? Some codes in subexpressionEliminationForWholeStageCodegen are never used actually. Remove them using this PR. ## How was this patch tested? Local unit tests. Author: Yucai Yu <yucai.yu@intel.com> Closes #14366 from yucai/subExpr_unused_codes.
*	[SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays	Sean Owen	2016-09-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]() ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14895 from srowen/SPARK-17331.
*	[SPARK-17263][SQL] Add hexadecimal literal parsing	Herman van Hovell	2016-09-01	3	-20/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds the ability to parse SQL (hexadecimal) binary literals (AKA bit strings). It follows the following syntax `X'[Hexadecimal Characters]+'`, for example: `X'01AB'` would create a binary the following binary array `0x01AB`. If an uneven number of hexadecimal characters is passed, then the upper 4 bits of the initial byte are kept empty, and the lower 4 bits are filled using the first character. For example `X'1C7'` would create the following binary array `0x01C7`. Binary data (Array[Byte]) does not have a proper `hashCode` and `equals` functions. This meant that comparing `Literal`s containing binary data was a pain. I have updated Literal.hashCode and Literal.equals to deal properly with binary data. ## How was this patch tested? Added tests to the `ExpressionParserSuite`, `SQLQueryTestSuite` and `ExpressionSQLBuilderSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14832 from hvanhovell/SPARK-17263.
*	[SPARK-17271][SQL] Remove redundant `semanticEquals()` from `SortOrder`	Tejas Patil	2016-09-01	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Removing `semanticEquals()` from `SortOrder` because it can use the `semanticEquals()` provided by its parent class (`Expression`). This was as per suggestion by cloud-fan at https://github.com/apache/spark/pull/14841/files/7192418b3a26a14642fc04fc92bf496a954ffa5d#r77106801 ## How was this patch tested? Ran the test added in https://github.com/apache/spark/pull/14841 Author: Tejas Patil <tejasp@fb.com> Closes #14910 from tejasapatil/SPARK-17271_remove_semantic_ordering.
*	[SPARK-16283][SQL] Implements percentile_approx aggregation function which ↵	Sean Zhong	2016-09-01	3	-0/+661
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	supports partial aggregation. ## What changes were proposed in this pull request? This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`. ### Syntax: ``` # Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory. percentile_approx(col, percentage [, accuracy]) # Returns percentile value array at given percentage value array percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) ``` ### Features: 1. This function supports partial aggregation. 2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint. 3. This function supports window function aggregation. ### Example usages: ``` ## Returns the 25th percentile value, with default accuracy SELECT percentile_approx(col, 0.25) FROM table ## Returns an array of percentile value (25th, 50th, 75th), with default accuracy SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table ## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error SELECT percentile_approx(col, 0.25, 100) FROM table ## Returns the 25th, and 50th percentile values, with custom accuracy value 100 SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table ``` ### NOTE: 1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)` 2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal. ## How was this patch tested? Unit test, and Sql query test. ## Acknowledgement 1. This PR's work in based on lw-lin's PR https://github.com/apache/spark/pull/14298, with improvements like supporting partial aggregation, fixing out of memory issue. Author: Sean Zhong <seanzhong@databricks.com> Closes #14868 from clockfly/appro_percentile_try_2.
*	[SPARK-15985][SQL] Eliminate redundant cast from an array without null or a ↵	Kazuaki Ishizaki	2016-08-31	3	-0/+76
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	map without null ## What changes were proposed in this pull request? This PR eliminates redundant cast from an `ArrayType` with `containsNull = false` or a `MapType` with `containsNull = false`. For example, in `ArrayType` case, current implementation leaves a cast `cast(value#63 as array<double>).toDoubleArray`. However, we can eliminate `cast(value#63 as array<double>)` if we know `value#63` does not include `null`. This PR apply this elimination for `ArrayType` and `MapType` in `SimplifyCasts` at a plan optimization phase. In summary, we got 1.2-1.3x performance improvements over the code before applying this PR. Here are performance results of benchmark programs: ``` test("Read array in Dataset") { import sparkSession.implicits._ val iters = 5 val n = 1024 * 1024 val rows = 15 val benchmark = new Benchmark("Read primnitive array", n) val rand = new Random(511) val intDS = sparkSession.sparkContext.parallelize(0 until rows, 1) .map(i => Array.tabulate(n)(i => i)).toDS() intDS.count() // force to create ds val lastElement = n - 1 val randElement = rand.nextInt(lastElement) benchmark.addCase(s"Read int array in Dataset", numIters = iters)(iter => { val idx0 = randElement val idx1 = lastElement intDS.map(a => a(0) + a(idx0) + a(idx1)).collect }) val doubleDS = sparkSession.sparkContext.parallelize(0 until rows, 1) .map(i => Array.tabulate(n)(i => i.toDouble)).toDS() doubleDS.count() // force to create ds benchmark.addCase(s"Read double array in Dataset", numIters = iters)(iter => { val idx0 = randElement val idx1 = lastElement doubleDS.map(a => a(0) + a(idx0) + a(idx1)).collect }) benchmark.run() } Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4 Intel(R) Core(TM) i5-5257U CPU 2.70GHz without this PR Read primnitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Read int array in Dataset 525 / 690 2.0 500.9 1.0X Read double array in Dataset 947 / 1209 1.1 902.7 0.6X with this PR Read primnitive array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Read int array in Dataset 400 / 492 2.6 381.5 1.0X Read double array in Dataset 788 / 870 1.3 751.4 0.5X ``` An example program that originally caused this performance issue. ``` val ds = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDS() val ds2 = ds.map(p => { var s = 0.0 for (i <- 0 to 2) { s += p(i) } s }) ds2.show ds2.explain(true) ``` Plans before this PR ``` == Parsed Logical Plan == 'SerializeFromObject [input[0, double, true] AS value#68] +- 'MapElements <function1>, obj#67: double +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#66: [D +- LocalRelation [value#63] == Analyzed Logical Plan == value: double SerializeFromObject [input[0, double, true] AS value#68] +- MapElements <function1>, obj#67: double +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D +- LocalRelation [value#63] == Optimized Logical Plan == SerializeFromObject [input[0, double, true] AS value#68] +- MapElements <function1>, obj#67: double +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D +- LocalRelation [value#63] == Physical Plan == SerializeFromObject [input[0, double, true] AS value#68] +- MapElements <function1>, obj#67: double +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D +- LocalTableScan [value#63] ``` Plans after this PR ``` == Parsed Logical Plan == 'SerializeFromObject [input[0, double, true] AS value#6] +- 'MapElements <function1>, obj#5: double +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#4: [D +- LocalRelation [value#1] == Analyzed Logical Plan == value: double SerializeFromObject [input[0, double, true] AS value#6] +- MapElements <function1>, obj#5: double +- DeserializeToObject cast(value#1 as array<double>).toDoubleArray, obj#4: [D +- LocalRelation [value#1] == Optimized Logical Plan == SerializeFromObject [input[0, double, true] AS value#6] +- MapElements <function1>, obj#5: double +- DeserializeToObject value#1.toDoubleArray, obj#4: [D +- LocalRelation [value#1] == Physical Plan == SerializeFromObject [input[0, double, true] AS value#6] +- MapElements <function1>, obj#5: double +- DeserializeToObject value#1.toDoubleArray, obj#4: [D +- LocalTableScan [value#1] ``` ## How was this patch tested? Tested by new test cases in `SimplifyCastsSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #13704 from kiszk/SPARK-15985.
*	[SPARK-17234][SQL] Table Existence Checking when Index Table with the Same ↵	gatorsmile	2016-08-30	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Name Exists ### What changes were proposed in this pull request? Hive Index tables are not supported by Spark SQL. Thus, we issue an exception when users try to access Hive Index tables. When the internal function `tableExists` tries to access Hive Index tables, it always gets the same error message: ```Hive index table is not supported```. This message could be confusing to users, since their SQL operations could be completely unrelated to Hive Index tables. For example, when users try to alter a table to a new name and there exists an index table with the same name, the expected exception should be a `TableAlreadyExistsException`. This PR made the following changes: - Introduced a new `AnalysisException` type: `SQLFeatureNotSupportedException`. When users try to access an `Index Table`, we will issue a `SQLFeatureNotSupportedException`. - `tableExists` returns `true` when hitting a `SQLFeatureNotSupportedException` and the feature is `Hive index table`. - Add a checking `requireTableNotExists` for `SessionCatalog`'s `createTable` API; otherwise, the current implementation relies on the Hive's internal checking. ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #14801 from gatorsmile/tableExists.
*	[SPARK-17301][SQL] Remove unused classTag field from AtomicType base class	Josh Rosen	2016-08-30	1	-9/+1
\| \| \| \| \| \| \| \|	There's an unused `classTag` val in the AtomicType base class which is causing unnecessary slowness in deserialization because it needs to grab ScalaReflectionLock and create a new runtime reflection mirror. Removing this unused code gives a small but measurable performance boost in SQL task deserialization. Author: Josh Rosen <joshrosen@databricks.com> Closes #14869 from JoshRosen/remove-unused-classtag.
*	[SPARK-17063] [SQL] Improve performance of MSCK REPAIR TABLE with Hive metastore	Davies Liu	2016-08-29	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR split the the single `createPartitions()` call into smaller batches, which could prevent Hive metastore from OOM (caused by millions of partitions). It will also try to gather all the fast stats (number of files and total size of all files) in parallel to avoid the bottle neck of listing the files in metastore sequential, which is controlled by spark.sql.gatherFastStats (enabled by default). ## How was this patch tested? Tested locally with 10000 partitions and 100 files with embedded metastore, without gathering fast stats in parallel, adding partitions took 153 seconds, after enable that, gathering the fast stats took about 34 seconds, adding these partitions took 25 seconds (most of the time spent in object store), 59 seconds in total, 2.5X faster (with larger cluster, gathering will much faster). Author: Davies Liu <davies@databricks.com> Closes #14607 from davies/repair_batch.
*	[SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordering is ↵	Tejas Patil	2016-08-28	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	semantically same as required ordering ## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-17271 Planner is adding un-needed SORT operation due to bug in the way comparison for `SortOrder` is done at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253 `SortOrder` needs to be compared semantically because `Expression` within two `SortOrder` can be "semantically equal" but not literally equal objects. eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")` Expression in required SortOrder: ``` AttributeReference( name = "col1", dataType = LongType, nullable = false ) (exprId = exprId, qualifier = Some("a") ) ``` Expression in child SortOrder: ``` AttributeReference( name = "col1", dataType = LongType, nullable = false ) (exprId = exprId) ``` Notice that the output column has a qualifier but the child attribute does not but the inherent expression is the same and hence in this case we can say that the child satisfies the required sort order. This PR includes following changes: - Added a `semanticEquals` method to `SortOrder` so that it can compare underlying child expressions semantically (and not using default Object.equals) - Fixed `EnsureRequirements` to use semantic comparison of SortOrder ## How was this patch tested? - Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite` Author: Tejas Patil <tejasp@fb.com> Closes #14841 from tejasapatil/SPARK-17271_sort_order_equals_bug.
*	[SPARK-17274][SQL] Move join optimizer rules into a separate file	Reynold Xin	2016-08-27	2	-106/+134
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various join rules into a single file. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14846 from rxin/SPARK-17274.
*	[SPARK-17273][SQL] Move expression optimizer rules into a separate file	Reynold Xin	2016-08-27	2	-460/+507
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various expression optimization rules into a single file. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14845 from rxin/SPARK-17273.
*	[SPARK-17272][SQL] Move subquery optimizer rules into its own file	Reynold Xin	2016-08-27	2	-323/+356
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various subquery rules into a single file. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14844 from rxin/SPARK-17272.
*	[SPARK-17269][SQL] Move finish analysis optimization stage into its own file	Reynold Xin	2016-08-26	3	-39/+66
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various finish analysis optimization stage rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14838 from rxin/SPARK-17269.
*	[SPARK-17270][SQL] Move object optimization rules into its own file	Reynold Xin	2016-08-26	2	-71/+98
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various Dataset object optimization rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #14839 from rxin/SPARK-17270.
*	[SPARK-17244] Catalyst should not pushdown non-deterministic join conditions	Sameer Agarwal	2016-08-26	2	-7/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that. ## How was this patch tested? A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14815 from sameeragarwal/constraint-inputfile.
*	[SPARK-17246][SQL] Add BigDecimal literal	Herman van Hovell	2016-08-26	4	-2/+29
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values. ## How was this patch tested? Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14819 from hvanhovell/SPARK-17246.
*	[SPARK-17187][SQL][FOLLOW-UP] improve document of TypedImperativeAggregate	Wenchen Fan	2016-08-26	1	-40/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? improve the document to make it easier to understand and also mention window operator. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14822 from cloud-fan/object-agg.
*	[SPARK-17212][SQL] TypeCoercion supports widening conversion between ↵	hyukjinkwon	2016-08-26	2	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DateType and TimestampType ## What changes were proposed in this pull request? Currently, type-widening does not work between `TimestampType` and `DateType`. This applies to `SetOperation`, `Union`, `In`, `CaseWhen`, `Greatest`, `Leatest`, `CreateArray`, `CreateMap`, `Coalesce`, `NullIf`, `IfNull`, `Nvl` and `Nvl2`, . This PR adds the support for widening `DateType` to `TimestampType` for them. For a simple example, Before ```scala Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show() ``` shows below: ``` cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The expressions should all have the same type, got GREATEST(timestamp, date) ``` or union as below: ```scala val a = Seq(Tuple1(new Timestamp(0))).toDF() val b = Seq(Tuple1(new Date(0))).toDF() a.union(b).show() ``` shows below: ``` Union can only be performed on tables with the compatible column types. DateType <> TimestampType at the first column of the second table; ``` After ```scala Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", "b").selectExpr("greatest(a, b)").show() ``` shows below: ``` +----------------------------------------------------+ \|greatest(CAST(a AS TIMESTAMP), CAST(b AS TIMESTAMP))\| +----------------------------------------------------+ \| 1969-12-31 16:00:...\| +----------------------------------------------------+ ``` or union as below: ```scala val a = Seq(Tuple1(new Timestamp(0))).toDF() val b = Seq(Tuple1(new Date(0))).toDF() a.union(b).show() ``` shows below: ``` +--------------------+ \| _1\| +--------------------+ \|1969-12-31 16:00:...\| \|1969-12-31 00:00:...\| +--------------------+ ``` ## How was this patch tested? Unit tests in `TypeCoercionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #14786 from HyukjinKwon/SPARK-17212.
*	[SPARK-17187][SQL] Supports using arbitrary Java object as internal ↵	Sean Zhong	2016-08-25	1	-0/+141
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aggregation buffer object ## What changes were proposed in this pull request? This PR introduces an abstract class `TypedImperativeAggregate` so that an aggregation function of TypedImperativeAggregate can use arbitrary user-defined Java object as intermediate aggregation buffer object. This has advantages like: 1. It now can support larger category of aggregation functions. For example, it will be much easier to implement aggregation function `percentile_approx`, which has a complex aggregation buffer definition. 2. It can be used to avoid doing serialization/de-serialization for every call of `update` or `merge` when converting domain specific aggregation object to internal Spark-Sql storage format. 3. It is easier to integrate with other existing monoid libraries like algebird, and supports more aggregation functions with high performance. Please see `org.apache.spark.sql.TypedImperativeAggregateSuite.TypedMaxAggregate` to find an example of how to defined a `TypedImperativeAggregate` aggregation function. Please see Java doc of `TypedImperativeAggregate` and Jira ticket SPARK-17187 for more information. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #14753 from clockfly/object_aggregation_buffer_try_2.
*	[SPARK-17205] Literal.sql should handle Infinity and NaN	Josh Rosen	2016-08-26	1	-2/+15
\| \| \| \| \| \| \| \|	This patch updates `Literal.sql` to properly generate SQL for `NaN` and `Infinity` float and double literals: these special values need to be handled differently from regular values, since simply appending a suffix to the value's `toString()` representation will not work for these values. Author: Josh Rosen <joshrosen@databricks.com> Closes #14777 from JoshRosen/SPARK-17205.
*	[SPARK-16991][SPARK-17099][SPARK-17120][SQL] Fix Outer Join Elimination when ↵	gatorsmile	2016-08-25	2	-12/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Filter's isNotNull Constraints Unable to Filter Out All Null-supplying Rows ### What changes were proposed in this pull request? This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`. Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example, ```Scala val a = Seq((1, 2), (2, 3)).toDF("a", "b") val b = Seq((2, 5), (3, 4)).toDF("a", "c") val c = Seq((3, 1)).toDF("a", "d") val ab = a.join(b, Seq("a"), "fullouter") ab.join(c, "a").explain(true) ``` The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result. ``` Project [a#251, b#227, c#237, d#247] +- Join Inner, (a#251 = a#246) :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237] : +- Join FullOuter, (a#226 = a#236) : :- Project [_1#223 AS a#226, _2#224 AS b#227] : : +- LocalRelation [_1#223, _2#224] : +- Project [_1#233 AS a#236, _2#234 AS c#237] : +- LocalRelation [_1#233, _2#234] +- Project [_1#243 AS a#246, _2#244 AS d#247] +- LocalRelation [_1#243, _2#244] == Optimized Logical Plan == Project [a#251, b#227, c#237, d#247] +- Join Inner, (a#251 = a#246) :- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237] : +- Filter isnotnull(coalesce(a#226, a#236)) : +- Join FullOuter, (a#226 = a#236) : :- LocalRelation [a#226, b#227] : +- LocalRelation [a#236, c#237] +- LocalRelation [a#246, d#247] ``` A note to the `Committer`, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. https://github.com/apache/spark/pull/14580 ### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14661 from gatorsmile/fixOuterJoinElimination.
*	[SPARK-17061][SPARK-17093][SQL] MapObjects` should make copies of ↵	Liwei Lin	2016-08-25	3	-2/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	unsafe-backed data ## What changes were proposed in this pull request? Currently `MapObjects` does not make copies of unsafe-backed data, leading to problems like [SPARK-17061](https://issues.apache.org/jira/browse/SPARK-17061) [SPARK-17093](https://issues.apache.org/jira/browse/SPARK-17093). This patch makes `MapObjects` make copies of unsafe-backed data. Generated code - prior to this patch: ```java ... /* 295 / if (isNull12) { / 296 / convertedArray1[loopIndex1] = null; / 297 / } else { / 298 / convertedArray1[loopIndex1] = value12; / 299 / } ... ``` Generated code - after this patch: ```java ... / 295 / if (isNull12) { / 296 / convertedArray1[loopIndex1] = null; / 297 / } else { / 298 / convertedArray1[loopIndex1] = value12 instanceof UnsafeRow? value12.copy() : value12; / 299 */ } ... ``` ## How was this patch tested? Add a new test case which would fail without this patch. Author: Liwei Lin <lwlin7@gmail.com> Closes #14698 from lw-lin/mapobjects-copy.
*	[SPARK-17190][SQL] Removal of HiveSharedState	gatorsmile	2016-08-25	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Since `HiveClient` is used to interact with the Hive metastore, it should be hidden in `HiveExternalCatalog`. After moving `HiveClient` into `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes straightforward. After removal of `HiveSharedState`, the reflection logic is directly applied on the choice of `ExternalCatalog` types, based on the configuration of `CATALOG_IMPLEMENTATION`. ~~`HiveClient` is also used/invoked by the other entities besides HiveExternalCatalog, we defines the following two APIs: getClient and getNewClient~~ ### How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14757 from gatorsmile/removeHiveClient.
*	[SPARK-17228][SQL] Not infer/propagate non-deterministic constraints	Sameer Agarwal	2016-08-24	2	-1/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them. ## How was this patch tested? Added a new test in `ConstraintPropagationSuite`. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14795 from sameeragarwal/deterministic-constraints.
*	[SPARK-16983][SQL] Add `prettyName` for row_number, dense_rank, ↵	Dongjoon Hyun	2016-08-24	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	percent_rank, cume_dist ## What changes were proposed in this pull request? Currently, two-word window functions like `row_number`, `dense_rank`, `percent_rank`, and `cume_dist` are expressed without `_` in error messages. We had better show the correct names. Before ```scala scala> sql("select row_number()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: rownumber() ``` After ```scala scala> sql("select row_number()").show java.lang.UnsupportedOperationException: Cannot evaluate expression: row_number() ``` ## How was this patch tested? Pass the Jenkins and manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14571 from dongjoon-hyun/SPARK-16983.
*	[SPARK-17186][SQL] remove catalog table type INDEX	Wenchen Fan	2016-08-23	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Actually Spark SQL doesn't support index, the catalog table type `INDEX` is from Hive. However, most operations in Spark SQL can't handle index table, e.g. create table, alter table, etc. Logically index table should be invisible to end users, and Hive also generates special table name for index table to avoid users accessing it directly. Hive has special SQL syntax to create/show/drop index tables. At Spark SQL side, although we can describe index table directly, but the result is unreadable, we should use the dedicated SQL syntax to do it(e.g. `SHOW INDEX ON tbl`). Spark SQL can also read index table directly, but the result is always empty.(Can hive read index table directly?) This PR remove the table type `INDEX`, to make it clear that Spark SQL doesn't support index currently. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14752 from cloud-fan/minor2.
*	[SPARK-17194] Use single quotes when generating SQL for string literals	Josh Rosen	2016-08-23	1	-2/+2
\| \| \| \| \| \| \| \|	When Spark emits SQL for a string literal, it should wrap the string in single quotes, not double quotes. Databases which adhere more strictly to the ANSI SQL standards, such as Postgres, allow only single-quotes to be used for denoting string literals (see http://stackoverflow.com/a/1992331/590203). Author: Josh Rosen <joshrosen@databricks.com> Closes #14763 from JoshRosen/SPARK-17194.
*	[SPARK-17199] Use CatalystConf.resolver for case-sensitivity comparison	Jacek Laskowski	2016-08-23	1	-7/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Use `CatalystConf.resolver` consistently for case-sensitivity comparison (removed dups). ## How was this patch tested? Local build. Waiting for Jenkins to ensure clean build and test. Author: Jacek Laskowski <jacek@japila.pl> Closes #14771 from jaceklaskowski/17199-catalystconf-resolver.