spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-15033][SQL] fix a flaky test in CachedTableSuite	Wenchen Fan	2016-04-30	2	-7/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`. However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore. This PR update that test to not require to lock on `AccumulatorContext`. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #12811 from cloud-fan/flaky.
*	[SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types	Hossein	2016-04-30	6	-42/+174
\| \| \| \| \| \| \| \| \| \| \| \|	1. Adds the following options for parsing NaNs: nanValue 2. Adds the following options for parsing infinity: positiveInf, negativeInf. `TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite` Author: Hossein <hossein@databricks.com> Closes #11947 from falaki/SPARK-14143.
*	[SPARK-15034][SPARK-15035][SPARK-15036][SQL] Use spark.sql.warehouse.dir as ↵	Yin Huai	2016-04-30	10	-22/+236
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the warehouse location This PR contains three changes: 1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir. 2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string. 3. When we create a database, we need to make the path qualified. Existing tests and new tests Author: Yin Huai <yhuai@databricks.com> Closes #12812 from yhuai/warehouse.
*	[SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfs	Reynold Xin	2016-04-30	6	-55/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12806 from rxin/SPARK-15028.
*	[SPARK-14850][ML] convert primitive array from/to unsafe array directly in ↵	Wenchen Fan	2016-04-29	3	-3/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	VectorUDT/MatrixUDT ## What changes were proposed in this pull request? This PR adds `fromPrimitiveArray` and `toPrimitiveArray` in `UnsafeArrayData`, so that we can do the conversion much faster in VectorUDT/MatrixUDT. ## How was this patch tested? existing tests and new test suite `UnsafeArraySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12640 from cloud-fan/ml.
*	[SPARK-13667][SQL] Support for specifying custom date format for date and ↵	hyukjinkwon	2016-04-29	8	-66/+173
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	timestamp types at CSV datasource. ## What changes were proposed in this pull request? This PR adds the support to specify custom date format for `DateType` and `TimestampType`. For `TimestampType`, this uses the given format to infer schema and also to convert the values For `DateType`, this uses the given format to convert the values. If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility. When it's given, then it uses `SimpleDateFormat` for parsing data. In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences. In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each. Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`. ## How was this patch tested? unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11550 from HyukjinKwon/SPARK-13667.
*	[SPARK-14591][SQL] Remove DataTypeParser and add more keywords to the ↵	Yin Huai	2016-04-29	9	-232/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	nonReserved list. ## What changes were proposed in this pull request? CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12796 from yhuai/removeDataTypeParser.
*	[SPARK-14757] [SQL] Fix nullability bug in EqualNullSafe codegen	Reynold Xin	2016-04-29	2	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch fixes a null handling bug in EqualNullSafe's code generation. ## How was this patch tested? Updated unit test so they would fail without the fix. Closes #12628. Author: Reynold Xin <rxin@databricks.com> Author: Arash Nabili <arash@levyx.com> Closes #12799 from rxin/equalnullsafe.
*	[SPARK-14917][SQL] Enable some ORC compressions tests for writing	hyukjinkwon	2016-04-29	1	-29/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14917 As it is described in the JIRA, it seems Hive 1.2.1 which Spark uses now supports snappy and none. So, this PR enables some tests for writing ORC files with compression codes, `SNAPPY` and `NONE`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12699 from HyukjinKwon/SPARK-14917.
*	[SPARK-15012][SQL] Simplify configuration API further	Andrew Or	2016-04-29	15	-154/+104
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. Remove all the `spark.setConf` etc. Just expose `spark.conf` 2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused This was done for both the Python and Scala APIs. ## How was this patch tested? `SQLConfSuite`, python tests. This one fixes the failed tests in #12787 Closes #12787 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12798 from yhuai/conf-api.
*	[SPARK-15019][SQL] Propagate all Spark Confs to HiveConf created in ↵	Yin Huai	2016-04-29	7	-40/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveClientImpl ## What changes were proposed in this pull request? This PR makes two changes: 1. We will propagate Spark Confs to HiveConf created in HiveClientImpl. So, users can also use spark conf to set warehouse location and metastore url. 2. In sql/hive, HiveClientImpl will be the only place where we create a new HiveConf. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12791 from yhuai/onlyUseHiveConfInHiveClientImpl.
*	[SPARK-14858] [SQL] Enable subquery pushdown	Herman van Hovell	2016-04-29	13	-318/+390
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example : ```scala range(0, 10).registerTempTable("a") range(5, 15).registerTempTable("b") range(7, 25).registerTempTable("c") range(3, 12).registerTempTable("d") val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)") plan.explain(true) ``` Leads to the following Analyzed & Optimized plans: ``` == Parsed Logical Plan == ... == Analyzed Logical Plan == id: bigint, id: bigint, id: bigint Project [id#0L,id#4L,id#8L] +- Filter predicate-subquery#16 [(id#0L = id#12L)] : +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)] : +- Project [id#12L] : +- SubqueryAlias d : +- Range 3, 12, 1, 8, [id#12L] +- Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- SubqueryAlias a : : +- Range 0, 10, 1, 8, [id#0L] : +- SubqueryAlias b : +- Range 5, 15, 1, 8, [id#4L] +- SubqueryAlias c +- Range 7, 25, 1, 8, [id#8L] == Optimized Logical Plan == Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- Join LeftSemi, Some((id#0L = id#12L)) : : :- Range 0, 10, 1, 8, [id#0L] : : +- Range 3, 12, 1, 8, [id#12L] : +- Range 5, 15, 1, 8, [id#4L] +- Range 7, 25, 1, 8, [id#8L] == Physical Plan == ... ``` I have also taken the opportunity to move quite a bit of code around: - Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do. - I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child). - Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution. These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere. Current tests and added test cases in FilterPushdownSuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12720 from hvanhovell/SPARK-14858.
*	[SPARK-14988][PYTHON] SparkSession API follow-ups	Andrew Or	2016-04-29	6	-4/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Addresses comments in #12765. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12784 from andrewor14/python-followup.
*	[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.	Sun Rui	2016-04-29	6	-9/+179
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.
*	[SPARK-15013][SQL] Remove hiveConf from HiveSessionState	Yin Huai	2016-04-29	2	-30/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The hiveConf in HiveSessionState is not actually used anymore. Let's remove it. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12786 from yhuai/removeHiveConf.
*	[SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns	Cheng Lian	2016-04-29	3	-15/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column. ## How was this patch tested? A test case is added to test the invalid sorting order by checking exception message. Author: Cheng Lian <lian@databricks.com> Closes #12759 from liancheng/spark-14981.
*	[SPARK-15004][SQL] Remove zookeeper service discovery code in thrift-server	Reynold Xin	2016-04-29	5	-506/+6
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We recently inlined Hive's thrift server code in SPARK-15004. This patch removes the code related to zookeeper service discovery, Tez, and Hive on Spark, since they are irrelevant. ## How was this patch tested? N/A - removing dead code Author: Reynold Xin <rxin@databricks.com> Closes #12780 from rxin/SPARK-15004.
*	[SPARK-15011][SQL][TEST] Ignore ↵	Yin Huai	2016-04-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelation This test always fail with sbt's hadoop 2.3 and 2.4 tests. Let'e disable it for now and investigate the problem. Author: Yin Huai <yhuai@databricks.com> Closes #12783 from yhuai/SPARK-15011-ignore.
*	[SPARK-14988][PYTHON] SparkSession catalog and conf API	Andrew Or	2016-04-29	2	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12765 from andrewor14/python-spark-session-more.
*	[SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserver	Davies Liu	2016-04-29	175	-4/+69969
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR copy the thrift-server from hive-service-1.2 (including TCLIService.thrift and generated Java source code) into sql/hive-thriftserver, so we can do further cleanup and improvements. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12764 from davies/thrift_server.
*	[SPARK-14994][SQL] Remove execution hive from HiveSessionState	Reynold Xin	2016-04-29	20	-309/+327
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes executionHive from HiveSessionState and HiveSharedState. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12770 from rxin/SPARK-14994.
*	[SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQL	Sameer Agarwal	2016-04-29	1	-0/+1225
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12771 from sameeragarwal/tpcds-2.
*	[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join	gatorsmile	2016-04-29	12	-111/+132
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.
*	[HOTFIX] Disable flaky test StatisticsSuite.analyze MetastoreRelations	Reynold Xin	2016-04-29	1	-1/+2
\|
*	[MINOR][DOC] Minor typo fixes	Zheng RuiFeng	2016-04-28	1	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Minor typo fixes ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12755 from zhengruifeng/fix_doc_dataset.
*	[SPARK-14991][SQL] Remove HiveNativeCommand	Reynold Xin	2016-04-28	15	-283/+51
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive. ## How was this patch tested? Updated tests to reflect this. Author: Reynold Xin <rxin@databricks.com> Closes #12769 from rxin/SPARK-14991.
*	[HOTFIX][CORE] fix a concurrence issue in NewAccumulator	Wenchen Fan	2016-04-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized. This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized. ## How was this patch tested? I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone. Author: Wenchen Fan <wenchen@databricks.com> Closes #12773 from cloud-fan/debug.
*	Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Yin Huai	2016-04-28	4	-6/+4
\| \| \| \| \| \|	spark-mllib-local" This reverts commit dae538a4d7c36191c1feb02ba87ffc624ab960dc.
*	[SPARK-14613][ML] Add @Since into the matrix and vector classes in ↵	Pravin Gadakh	2016-04-28	4	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-mllib-local ## What changes were proposed in this pull request? This PR adds `since` tag into the matrix and vector classes in spark-mllib-local. ## How was this patch tested? Scala-style checks passed. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12416 from pravingadakh/SPARK-14613.
*	[SPARK-14970][SQL] Prevent DataSource from enumerates all files in a ↵	Tathagata Das	2016-04-28	1	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	directory if there is user specified schema ## What changes were proposed in this pull request? The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit. ## How was this patch tested? Hard to test this with unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12748 from tdas/SPARK-14970.
*	[SPARK-14965][SQL] Indicate an exception is thrown for a missing struct field	Gregory Hart	2016-04-28	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix to ScalaDoc for StructType. ## How was this patch tested? Built locally. Author: Gregory Hart <greg.hart@thinkbiganalytics.com> Closes #12758 from freastro/hotfix/SPARK-14965.
*	[SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType ↵	Liang-Chi Hsieh	2016-04-28	4	-4/+224
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	annotation ## What changes were proposed in this pull request? Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes. For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty. We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation. ## How was this patch tested? `UserDefinedTypeSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12259 from viirya/improve-sql-usertype.
*	[SPARK-14654][CORE] New accumulator API	Wenchen Fan	2016-04-28	30	-255/+154
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR introduces a new accumulator API which is much simpler than before: 1. the type hierarchy is simplified, now we only have an `Accumulator` class 2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue` 3. there in only one `register` method, the accumulator registration and cleanup registration are combined. 4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration. `SQLMetric` is a good example to show the simplicity of this new API. What we break: 1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue` 2. accumulator can't be serialized before registered. Problems need to be addressed in follow-ups: 1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value. 2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event? 3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #12612 from cloud-fan/acc.
*	[SPARK-14961] Build HashedRelation larger than 1G	Davies Liu	2016-04-27	2	-57/+107
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G. This PR improves LongToUnsafeRowMap to scale up to 8G bytes by using array of Long instead of array of byte. ## How was this patch tested? Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G. Author: Davies Liu <davies@databricks.com> Closes #12740 from davies/larger_broadcast.
*	[SPARK-12143][SQL] Binary type support for Hive thrift server	hyukjinkwon	2016-04-27	2	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-12143 This PR adds the support for conversion between `SparkRow` in Spark and `RowSet` in Hive for `BinaryType` as `Array[Byte]` (JDBC) ## How was this patch tested? Unittests in `HiveThriftBinaryServerSuite` (regression test) Closes #10139 Author: hyukjinkwon <gurwls223@gmail.com> Closes #12733 from HyukjinKwon/SPARK-12143.
*	[SPARK-14940][SQL] Move ExternalCatalog to own file	Andrew Or	2016-04-27	12	-188/+210
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness. ## How was this patch tested? Just moving things around. Author: Andrew Or <andrew@databricks.com> Closes #12721 from andrewor14/move-external-catalog.
*	[SPARK-14954] [SQL] Add PARTITION BY and BUCKET BY clause for data source ↵	Cheng Lian	2016-04-27	3	-3/+106
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CTAS syntax Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL: ``` CREATE TABLE <table-name> USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)] [PARTITIONED BY (col1, col2, ...)] [CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS] AS SELECT ... ``` Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax. Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12734 from liancheng/spark-14954.
*	[SPARK-14664][SQL] Implement DecimalAggregates optimization for Window queries	Dongjoon Hyun	2016-04-27	3	-12/+161
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer. Sum ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1) scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#21] +- Scan OneRowRelation[] ``` Average ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#42] +- Scan OneRowRelation[] ``` After this PR, those queries work fine and new optimized physical plans look like the followings. Sum ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#33] +- Scan OneRowRelation[] ``` Average ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#45] +- Scan OneRowRelation[] ``` In this PR, SUM over window pattern matching is based on the code of hvanhovell ; he should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (with newly added testcases) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12421 from dongjoon-hyun/SPARK-14664.
*	[SPARK-14874][SQL][STREAMING] Remove the obsolete Batch representation	Liwei Lin	2016-04-27	5	-30/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b) and then became useless. This patch: - removes the `Batch` class - ~~does some related renaming~~ (update: this has been reverted) - fixes some related comments ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12638 from lw-lin/remove-batch.
*	[SPARK-14950][SQL] Fix BroadcastHashJoin's unique key Anti-Joins	Herman van Hovell	2016-04-27	2	-14/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug. ### How was this patch tested? Added tests cases to `ExistenceJoinSuite`. cc davies gatorsmile Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12730 from hvanhovell/SPARK-14950.
*	[SPARK-14949][SQL] Remove HiveConf dependency from InsertIntoHiveTable	Reynold Xin	2016-04-27	1	-16/+14
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes the use of HiveConf from InsertIntoHiveTable. I think this is the last major use of HiveConf and after this we can try to remove the execution HiveConf. ## How was this patch tested? Internal refactoring and should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12728 from rxin/SPARK-14949.
*	[SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE ↵	Yin Huai	2016-04-27	184	-819/+194
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands ## What changes were proposed in this pull request? This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands. ## How was this patch tested? Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite. Author: Yin Huai <yhuai@databricks.com> Closes #12714 from yhuai/banNativeCommand.
*	[SPARK-14944][SPARK-14943][SQL] Remove HiveConf from HiveTableScanExec, ↵	Reynold Xin	2016-04-26	6	-46/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveTableReader, and ScriptTransformation ## What changes were proposed in this pull request? This patch removes HiveConf from HiveTableScanExec and HiveTableReader and instead just uses our own configuration system. I'm splitting the large change of removing HiveConf into multiple independent pull requests because it is very difficult to debug test failures when they are all combined in one giant one. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12727 from rxin/SPARK-14944.
*	[SPARK-14913][SQL] Simplify configuration API	Reynold Xin	2016-04-26	42	-671/+368
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users. As part of this patch, I also removed some config options deprecated in Spark 1.x. ## How was this patch tested? Updated relevant tests. Author: Reynold Xin <rxin@databricks.com> Closes #12689 from rxin/SPARK-14913.
*	[SPARK-13477][SQL] Expose new user-facing Catalog interface	Andrew Or	2016-04-26	31	-325/+1090
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? #12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface. ## How was this patch tested? See `CatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #12713 from andrewor14/user-facing-catalog.
*	[SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS	Dilip Biswal	2016-04-27	16	-31/+401
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands. Command Syntax: ``` SQL SHOW COLUMNS (FROM \| IN) table_identifier [(FROM \| IN) database] ``` ``` SQL SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] ``` ## How was this patch tested? Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite to verify plans. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #12222 from dilipbiswal/dkb_show_columns.
*	[SPARK-14929] [SQL] Disable vectorized map for wide schemas & high-precision ↵	Sameer Agarwal	2016-04-26	4	-31/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	decimals ## What changes were proposed in this pull request? While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time. ## How was this patch tested? This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks. Author: Sameer Agarwal <sameer@databricks.com> Closes #12710 from sameeragarwal/vectorized-enable.
*	[SPARK-14853] [SQL] Support LeftSemi/LeftAnti in SortMergeJoinExec	Davies Liu	2016-04-26	10	-175/+194
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec. This PR also simplify the join selection in SparkStrategy. ## How was this patch tested? Added new tests. Author: Davies Liu <davies@databricks.com> Closes #12668 from davies/smj_semi.
*	[MINOR] Follow-up to #12625	Andrew Or	2016-04-26	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes. Author: Andrew Or <andrew@databricks.com> Closes #12686 from andrewor14/visibility.
*	[SPARK-14912][SQL] Propagate data source options to Hadoop configuration	Reynold Xin	2016-04-26	11	-52/+94
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration. ## How was this patch tested? Used a mock data source implementation to test both the read path and the write path. Author: Reynold Xin <rxin@databricks.com> Closes #12688 from rxin/SPARK-14912.