spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-13643][SQL] Implement SparkSession	Andrew Or	2016-04-21	5	-197/+961
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? After removing most of `HiveContext` in 8fc267ab3322e46db81e725a5cb1adb5a71b2b4d we can now move existing functionality in `SQLContext` to `SparkSession`. As of this PR `SQLContext` becomes a simple wrapper that has a `SparkSession` and delegates all functionality to it. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #12553 from andrewor14/implement-spark-session.
*	[SPARK-14801][SQL] Move MetastoreRelation to its own file	Reynold Xin	2016-04-21	2	-205/+232
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This class is currently in HiveMetastoreCatalog.scala, which is a large file that makes refactoring and searching of usage difficult. Moving it out so I can then do SPARK-14799 and make the review of that simpler. ## How was this patch tested? N/A - this is a straightforward move and should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12567 from rxin/SPARK-14801.
*	[SPARK-14795][SQL] Remove the use of Hive's variable substitution	Reynold Xin	2016-04-21	3	-11/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch builds on #12556 and completely removes the use of Hive's variable substitution. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12561 from rxin/SPARK-14795.
*	[SPARK-14799][SQL] Remove MetastoreRelation dependency from AnalyzeTable - ↵	Reynold Xin	2016-04-21	1	-26/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	part 1 ## What changes were proposed in this pull request? This patch isolates AnalyzeTable's dependency on MetastoreRelation into a single line. After this we can work on converging MetastoreRelation and CatalogTable. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12566 from rxin/SPARK-14799.
*	[SPARK-14783] Preserve full exception stacktrace in IsolatedClientLoader	Josh Rosen	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \|	In IsolatedClientLoader, we have a`catch` block which throws an exception without wrapping the original exception, causing the full exception stacktrace and any nested exceptions to be lost. This patch fixes this, improving the usefulness of classloading error messages. Author: Josh Rosen <joshrosen@databricks.com> Closes #12548 from JoshRosen/improve-logging-for-hive-classloader-issues.
*	[SPARK-14797][BUILD] Spark SQL POM should not hardcode spark-sketch_2.11 dep.	Josh Rosen	2016-04-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL's POM hardcodes a dependency on `spark-sketch_2.11`, which causes Scala 2.10 builds to include the `_2.11` dependency. This is harmless since `spark-sketch` is a pure-Java module (see #12334 for a discussion of dropping the Scala version suffixes from these modules' artifactIds), but it's confusing to people looking at the published POMs. This patch fixes this by using `${scala.binary.version}` to substitute the correct suffix, and also adds a set of Maven Enforcer rules to ensure that `_2.11` artifacts are not used in 2.10 builds (and vice-versa). /cc ahirreddy, who spotted this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #12563 from JoshRosen/fix-sketch-scala-version.
*	[HOTFIX] Remove wrong DDL tests	Liang-Chi Hsieh	2016-04-21	1	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? As we moved most parsing rules to `SparkSqlParser`, some tests expected to throw exception are not correct anymore. ## How was this patch tested? `DDLCommandSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12572 from viirya/hotfix-ddl.
*	[SPARK-14753][CORE] remove internal flag in Accumulable	Wenchen Fan	2016-04-21	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases: 1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered. 2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered. For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last. For 2, we can un-register these accumulators immediately. TODO: remove `internal` flag in `AccumulableInfo` with followup PR ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12525 from cloud-fan/acc.
*	[SPARK-14794][SQL] Don't pass analyze command into Hive	Reynold Xin	2016-04-21	2	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We shouldn't pass analyze command to Hive because some of those would require running MapReduce jobs. For now, let's just always run the no scan analyze. ## How was this patch tested? Updated test case to reflect this change. Author: Reynold Xin <rxin@databricks.com> Closes #12558 from rxin/parser-analyze.
*	[HOTFIX] Disable flaky tests	Reynold Xin	2016-04-21	1	-2/+2
\|
*	[SPARK-14792][SQL] Move as many parsing rules as possible into SQL parser	Reynold Xin	2016-04-21	14	-489/+568
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR. ## How was this patch tested? No test change. This simply moves code around. Author: Reynold Xin <rxin@databricks.com> Closes #12556 from rxin/SPARK-14792.
*	[SPARK-14786] Remove hive-cli dependency from hive subproject	Josh Rosen	2016-04-20	3	-7/+33
\| \| \| \| \| \| \| \| \| \| \| \|	The `hive` subproject currently depends on `hive-cli` in order to perform a check to see whether a `SessionState` is an instance of `org.apache.hadoop.hive.cli.CliSessionState` (see #9589). The introduction of this `hive-cli` dependency has caused problems for users whose Hive metastore JAR classpaths don't include the `hive-cli` classes (such as in #11495). This patch removes this dependency on `hive-cli` and replaces the `isInstanceOf` check by reflection. I added a Maven Enforcer rule to ban `hive-cli` from the `hive` subproject in order to make sure that this dependency is not accidentally reintroduced. /cc rxin yhuai adrian-wang preecet Author: Josh Rosen <joshrosen@databricks.com> Closes #12551 from JoshRosen/remove-hive-cli-dep-from-hive-subproject.
*	[SPARK-14782][SPARK-14778][SQL] Remove HiveConf dependency from ↵	Reynold Xin	2016-04-20	4	-40/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	HiveSqlAstBuilder ## What changes were proposed in this pull request? The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser. This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12550 from rxin/SPARK-14782.
*	[SPARK-14775][SQL] Remove TestHiveSparkSession.rewritePaths	Reynold Xin	2016-04-20	4	-22/+17
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The path rewrite in TestHiveSparkSession is pretty hacky. I think we can remove those complexity and just do a string replacement when we read the query files in. This would remove the overloading of runNativeSql in TestHive, which will simplify the removal of Hive specific variable substitution. ## How was this patch tested? This is a small test refactoring to simplify test infrastructure. Author: Reynold Xin <rxin@databricks.com> Closes #12543 from rxin/SPARK-14775.
*	[SPARK-14769][SQL] Create built-in functionality for variable substitution	Reynold Xin	2016-04-20	3	-0/+215
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage. Note that this pull request does not yet use this functionality anywhere. ## How was this patch tested? Added VariableSubstitutionSuite for unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12538 from rxin/SPARK-14769.
*	[SPARK-14770][SQL] Remove unused queries in hive module test resources	Reynold Xin	2016-04-20	690	-5352/+0
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently have five folders in queries: clientcompare, clientnegative, clientpositive, negative, and positive. Only clientpositive is used. We can remove the rest. ## How was this patch tested? N/A - removing unused test resources. Author: Reynold Xin <rxin@databricks.com> Closes #12540 from rxin/SPARK-14770.
*	[SPARK-14749][SQL, TESTS] PlannerSuite failed when it run individually	Subhobrata Dey	2016-04-20	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 3 testcases namely, ``` "count is partially aggregated" "count distinct is partially aggregated" "mixed aggregates are partially aggregated" ``` were failing when running PlannerSuite individually. The PR provides a fix for this. ## How was this patch tested? unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Subhobrata Dey <sbcd90@gmail.com> Closes #12532 from sbcd90/plannersuitetestsfix.
*	[SPARK-14678][SQL] Add a file sink log to support versioning and compaction	Shixiong Zhu	2016-04-20	6	-27/+616
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds a special log for FileStreamSink for two purposes: - Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink. - Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files. FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog. FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files). ## How was this patch tested? FileStreamSinkLogSuite Author: Shixiong Zhu <shixiong@databricks.com> Closes #12435 from zsxwing/sink-log.
*	[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState ↵	Andrew Or	2016-04-20	42	-547/+790
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and Create a SparkSession class ## What changes were proposed in this pull request? This PR has two main changes. 1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext. 2. Create a SparkSession Class, which will later be the entry point of Spark SQL users. ## How was this patch tested? Existing tests This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12522 from yhuai/spark-session.
*	[SPARK-14741][SQL] Fixed error in reading json file stream inside a ↵	Tathagata Das	2016-04-20	2	-1/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	partitioned directory ## What changes were proposed in this pull request? Consider the following directory structure dir/col=X/some-files If we create a text format streaming dataframe on `dir/col=X/` then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure: ``` 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 ``` The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined. ## How was this patch tested? New unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12517 from tdas/SPARK-14741.
*	[SPARK-14555] First cut of Python API for Structured Streaming	Burak Yavuz	2016-04-20	3	-19/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes: - ContinuousQuery - Trigger - ProcessingTime in pyspark under `pyspark.sql.streaming`. In addition, it contains the new methods added under: - `DataFrameWriter` a) `startStream` b) `trigger` c) `queryName` - `DataFrameReader` a) `stream` - `DataFrame` a) `isStreaming` This PR doesn't contain all methods exposed for `ContinuousQuery`, for example: - `exception` - `sourceStatuses` - `sinkStatus` They may be added in a follow up. This PR also contains some very minor doc fixes in the Scala side. ## How was this patch tested? Python doc tests TODO: - [ ] verify Python docs look good Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> Closes #12320 from brkyvz/stream-python.
*	[SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of ↵	Liwei Lin	2016-04-20	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	call FileSystem.get(conf) ## What changes were proposed in this pull request? - replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)` ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12450 from lw-lin/fix-fs-get.
*	[SPARK-9013][SQL] generate MutableProjection directly instead of return a ↵	Wenchen Fan	2016-04-20	14	-39/+35
\| \| \| \| \| \| \| \| \| \| \| \|	function `MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections. However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function. Author: Wenchen Fan <wenchen@databricks.com> Closes #7373 from cloud-fan/project.
*	[MINOR] [SQL] Re-enable `explode()` and `json_tuple()` testcases in ↵	Dongjoon Hyun	2016-04-19	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ExpressionToSQLSuite ## What changes were proposed in this pull request? Since [SPARK-12719: SQL Generation supports for generators](https://issues.apache.org/jira/browse/SPARK-12719) was resolved, this PR enables the related testcases: `explode()` and `json_tuple()`. ## How was this patch tested? Pass the Jenkins tests (with re-enabled test cases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12329 from dongjoon-hyun/minor_enable_testcases.
*	[SPARK-14600] [SQL] Push predicates through Expand	Wenchen Fan	2016-04-19	5	-14/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14600 This PR makes `Expand.output` have different attributes from the grouping attributes produced by the underlying `Project`, as they have different meaning, so that we can safely push down filter through `Expand` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12496 from cloud-fan/expand.
*	[SPARK-14704][CORE] create accumulators in TaskMetrics	Wenchen Fan	2016-04-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side. After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12472 from cloud-fan/acc.
*	[SPARK-13419] [SQL] Update SubquerySuite to use checkAnswer for validation	Luciano Resende	2016-04-19	1	-28/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Change SubquerySuite to validate test results utilizing checkAnswer helper method ## How was this patch tested? Existing tests Author: Luciano Resende <lresende@apache.org> Closes #12269 from lresende/SPARK-13419.
*	[SPARK-13929] Use Scala reflection for UDTs	Joan	2016-04-19	7	-114/+157
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Enable ScalaReflection and User Defined Types for plain Scala classes. This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet. ## How was this patch tested? Unit test Author: Joan <joan@goyeau.com> Closes #12149 from joan38/SPARK-13929-Scala-reflection.
*	[SPARK-14407][SQL] Hides HadoopFsRelation related data source API into ↵	Cheng Lian	2016-04-19	23	-548/+561
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	execution/datasources package #12178 ## What changes were proposed in this pull request? This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package. Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.
*	[SPARK-4226] [SQL] Support IN/EXISTS Subqueries	Herman van Hovell	2016-04-19	12	-42/+476
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### What changes were proposed in this pull request? This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms: - `[NOT] EXISTS(subquery)` - `[NOT] IN (subquery)` This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did. ### How was this patch tested? Modified parsing unit tests. Added tests to `org.apache.spark.sql.SQLQuerySuite` cc rxin, davies & chenghao-intel Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12306 from hvanhovell/SPARK-4226.
*	[SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type	Wenchen Fan	2016-04-19	3	-4/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`. However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`. This PR fixes this issue by calling `distinct` before declare the class menbers. ## How was this patch tested? new regression test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12468 from cloud-fan/bug.
*	[SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture ↵	Josh Rosen	2016-04-19	5	-15/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	full stacktrace When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread. This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`. I tested this manually using https://github.com/JoshRosen/spark/commit/16b31c825197ee31a50214c6ba3c1df08148f403, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR. /cc rxin nongli yhuai anabranch Author: Josh Rosen <joshrosen@databricks.com> Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.
*	[SPARK-12457] Fixed the Wrong Description and Missing Example in Collection ↵	gatorsmile	2016-04-19	2	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Functions #### What changes were proposed in this pull request? https://github.com/apache/spark/pull/12185 contains the original PR I submitted in https://github.com/apache/spark/pull/10418 However, it misses one of the extended example, a wrong description and a few typos for collection functions. This PR is fix all these issues. #### How was this patch tested? The existing test cases already cover it. Author: gatorsmile <gatorsmile@gmail.com> Closes #12492 from gatorsmile/expressionUpdate.
*	[SPARK-14491] [SQL] refactor object operator framework to make it easy to ↵	Wenchen Fan	2016-04-19	11	-217/+254
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	eliminate serializations ## What changes were proposed in this pull request? This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer. Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there. ## How was this patch tested? existing tests and new test in `EliminateSerializationSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12260 from cloud-fan/encoder.
*	[SPARK-13681][SPARK-14458][SPARK-14566][SQL] Add back once removed ↵	Cheng Lian	2016-04-19	7	-8/+403
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CommitFailureTestRelationSuite and SimpleTextHadoopFsRelationSuite ## What changes were proposed in this pull request? These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back. This PR also fixes two regressions: - SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature. - SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when - appending a Dataset `ds` to a persisted partitioned data source relation `t`, and - partition columns in `ds` don't all appear after data columns ## How was this patch tested? `CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup. `SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces. The two regressions are both covered by existing test cases. Author: Cheng Lian <lian@databricks.com> Closes #12179 from liancheng/spark-13681-commit-failure-test.
*	[SPARK-14577][SQL] Add spark.sql.codegen.maxCaseBranches config option	Dongjoon Hyun	2016-04-19	6	-35/+180
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf. ## How was this patch tested? Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12353 from dongjoon-hyun/SPARK-14577.
*	[SPARK-14398][SQL] Audit non-reserved keyword list in ANTLR4 parser	bomeng	2016-04-19	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I have compared non-reserved list in Antlr3 and Antlr4 one by one as well as all the existing keywords defined in Antlr4, added the missing keywords to the non-reserved keywords list. If we need to support more syntax, we can add more keywords by then. Any recommendation for the above is welcome. ## How was this patch tested? I manually checked the keywords one by one. Please let me know if there is a better way to test. Another thought: I suggest to put all the keywords definition and non-reserved list in order, that will be much easier to check in the future. Author: bomeng <bmeng@us.ibm.com> Closes #12191 from bomeng/SPARK-14398.
*	[SPARK-14595][SQL] add input metrics for FileScanRDD	Wenchen Fan	2016-04-18	1	-7/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is roughly based on the input metrics logic in `SqlNewHadoopRDD` ## How was this patch tested? Not sure how to write a test, I manually verified it in Spark UI. Author: Wenchen Fan <wenchen@databricks.com> Closes #12352 from cloud-fan/metrics.
*	[SPARK-14722][SQL] Rename upstreams() -> inputRDDs() in WholeStageCodegen	Sameer Agarwal	2016-04-18	11	-37/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Per rxin's suggestions, this patch renames `upstreams()` to `inputRDDs()` in `WholeStageCodegen` for better implied semantics ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12486 from sameeragarwal/codegen-cleanup.
*	[SPARK-14718][SQL] Avoid mutating ExprCode in doGenCode	Sameer Agarwal	2016-04-18	32	-440/+362
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation. ## How was this patch tested? Existing Tests Author: Sameer Agarwal <sameer@databricks.com> Closes #12483 from sameeragarwal/new-exprcode-2.
*	[SPARK-14667] Remove HashShuffleManager	Reynold Xin	2016-04-18	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager. ## How was this patch tested? Removed some tests related to the old manager. Author: Reynold Xin <rxin@databricks.com> Closes #12423 from rxin/SPARK-14667.
*	[SPARK-14674][SQL] Move HiveContext.hiveconf to HiveSessionState	Andrew Or	2016-04-18	18	-71/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This is just cleanup. This allows us to remove HiveContext later without inflating the diff too much. This PR fixes the conflicts of https://github.com/apache/spark/pull/12431. It also removes the `def hiveConf` from `HiveSqlParser`. So, we will pass the HiveConf associated with a session explicitly instead of relying on Hive's `SessionState` to pass `HiveConf`. ## How was this patch tested? Existing tests. Closes #12431 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12449 from yhuai/hiveconf.
*	[SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect ↵	Sameer Agarwal	2016-04-18	44	-298/+305
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the semantics ## What changes were proposed in this pull request? Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls. ## How was this patch tested? N/A (refactoring only) Author: Sameer Agarwal <sameer@databricks.com> Closes #12475 from sameeragarwal/gencode.
*	[MINOR] Revert removing explicit typing (changed in some examples and ↵	hyukjinkwon	2016-04-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	StatFunctions) ## What changes were proposed in this pull request? This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR). from ```scala words.foreachRDD { (rdd, time) => ... ``` to ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html) ## How was this patch tested? This was tested with `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12452 from HyukjinKwon/revert-explicit-typing.
*	[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState	Andrew Or	2016-04-18	8	-122/+175
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12463 from yhuai/sharedState.
*	[HOTFIX] Fix Scala 2.10 compilation break.	Reynold Xin	2016-04-18	1	-2/+2
\|
*	[SPARK-14580][SPARK-14655][SQL] Hive IfCoercion should preserve predicate.	Dongjoon Hyun	2016-04-18	5	-7/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose return-type are null. However, some UDFs need evaluations because they are designed to throw exceptions. This PR fixes that to preserve the predicates. Also, `assert_true` is implemented as Spark SQL function. Before ``` scala> sql("select if(assert_true(false),2,3)").head res2: org.apache.spark.sql.Row = [3] ``` After ``` scala> sql("select if(assert_true(false),2,3)").head ... ASSERT_TRUE ... ``` Hive ``` hive> select if(assert_true(false),2,3); OK Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed. ``` ## How was this patch tested? Pass the Jenkins tests (including a new testcase in `HivePlanTest`) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12340 from dongjoon-hyun/SPARK-14580.
*	[SPARK-14473][SQL] Define analysis rules to catch operations not supported ↵	Tathagata Das	2016-04-18	16	-12/+685
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in streaming ## What changes were proposed in this pull request? There are many operations that are currently not supported in the streaming execution. For example: - joining two streams - unioning a stream and a batch source - sorting - window functions (not time windows) - distinct aggregates Furthermore, executing a query with a stream source as a batch query should also fail. This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not. ## How was this patch tested? unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12246 from tdas/SPARK-14473.
*	[SPARK-14614] [SQL] Add `bround` function	Dongjoon Hyun	2016-04-18	7	-27/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) Hive (1.3 ~ 2.0) ``` hive> select round(2.5), bround(2.5); OK 3.0 2.0 ``` After this PR ```scala scala> sql("select round(2.5), bround(2.5)").head res0: org.apache.spark.sql.Row = [3,2] ``` ## How was this patch tested? Pass the Jenkins tests (with extended tests). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12376 from dongjoon-hyun/SPARK-14614.
*	[SPARK-14696][SQL] Add implicit encoders for boxed primitive types	Reynold Xin	2016-04-18	2	-0/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder: ```scala sqlContext.range(1000).map { i => i } ``` ## How was this patch tested? Added a unit test case for this. Author: Reynold Xin <rxin@databricks.com> Closes #12466 from rxin/SPARK-14696.