spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-12188][SQL][FOLLOW-UP] Code refactoring and comment correction in ↵	gatorsmile	2015-12-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Dataset APIs marmbrus This PR is to address your comment. Thanks for your review! Author: gatorsmile <gatorsmile@gmail.com> Closes #10214 from gatorsmile/followup12188.
*	[SPARK-12274][SQL] WrapOption should not have type constraint for child	Wenchen Fan	2015-12-14	1	-4/+1
\| \| \| \| \| \| \| \|	I think it was a mistake, and we have not catched it so far until https://github.com/apache/spark/pull/10260 which begin to check if the `fromRowExpression` is resolved. Author: Wenchen Fan <wenchen@databricks.com> Closes #10263 from cloud-fan/encoder.
*	[SPARK-12275][SQL] No plan for BroadcastHint in some condition	yucai	2015-12-13	2	-1/+8
\| \| \| \| \| \| \| \| \| \|	When SparkStrategies.BasicOperators's "case BroadcastHint(child) => apply(child)" is hit, it only recursively invokes BasicOperators.apply with this "child". It makes many strategies have no change to process this plan, which probably leads to "No plan" issue, so we use planLater to go through all strategies. https://issues.apache.org/jira/browse/SPARK-12275 Author: yucai <yucai.yu@intel.com> Closes #10265 from yucai/broadcast_hint.
*	[SPARK-12213][SQL] use multiple partitions for single distinct query	Davies Liu	2015-12-13	10	-990/+422
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, we could generate different plans for query with single distinct (depends on spark.sql.specializeSingleDistinctAggPlanning), one works better on low cardinality columns, the other works better for high cardinality column (default one). This PR change to generate a single plan (three aggregations and two exchanges), which work better in both cases, then we could safely remove the flag `spark.sql.specializeSingleDistinctAggPlanning` (introduced in 1.6). For a query like `SELECT COUNT(DISTINCT a) FROM table` will be ``` AGG-4 (count distinct) Shuffle to a single reducer Partial-AGG-3 (count distinct, no grouping) Partial-AGG-2 (grouping on a) Shuffle by a Partial-AGG-1 (grouping on a) ``` This PR also includes large refactor for aggregation (reduce 500+ lines of code) cc yhuai nongli marmbrus Author: Davies Liu <davies@databricks.com> Closes #10228 from davies/single_distinct.
*	[SPARK-12298][SQL] Fix infinite loop in DataFrame.sortWithinPartitions	Ankur Dave	2015-12-11	2	-3/+3
\| \| \| \| \| \| \| \|	Modifies the String overload to call the Column overload and ensures this is called in a test. Author: Ankur Dave <ankurdave@gmail.com> Closes #10271 from ankurdave/SPARK-12298.
*	[SPARK-12258] [SQL] passing null into ScalaUDF (follow-up)	Davies Liu	2015-12-11	2	-16/+23
\| \| \| \| \| \| \| \|	This is a follow-up PR for #10259 Author: Davies Liu <davies@databricks.com> Closes #10266 from davies/null_udf2.
*	[SPARK-12258][SQL] passing null into ScalaUDF	Davies Liu	2015-12-10	2	-6/+10
\| \| \| \| \| \| \| \| \| \|	Check nullability and passing them into ScalaUDF. Closes #10249 Author: Davies Liu <davies@databricks.com> Closes #10259 from davies/udf_null.
*	[SPARK-12251] Document and improve off-heap memory configurations	Josh Rosen	2015-12-10	3	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs. - Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6). - Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix. - Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion. - Document these configurations on the configuration page. Author: Josh Rosen <joshrosen@databricks.com> Closes #10237 from JoshRosen/SPARK-12251.
*	[SPARK-12228][SQL] Try to run execution hive's derby in memory.	Yin Huai	2015-12-10	4	-5/+9
\| \| \| \| \| \| \| \| \| \|	This PR tries to make execution hive's derby run in memory since it is a fake metastore and every time we create a HiveContext, we will switch to a new one. It is possible that it can reduce the flakyness of our tests that need to create HiveContext (e.g. HiveSparkSubmitSuite). I will test it more. https://issues.apache.org/jira/browse/SPARK-12228 Author: Yin Huai <yhuai@databricks.com> Closes #10204 from yhuai/derbyInMemory.
*	[SPARK-12250][SQL] Allow users to define a UDAF without providing details of ↵	Yin Huai	2015-12-10	2	-5/+64
\| \| \| \| \| \| \| \| \| \|	its inputSchema https://issues.apache.org/jira/browse/SPARK-12250 Author: Yin Huai <yhuai@databricks.com> Closes #10236 from yhuai/SPARK-12250.
*	[SPARK-12242][SQL] Add DataFrame.transform method	Reynold Xin	2015-12-10	2	-1/+14
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #10226 from rxin/df-transform.
*	[SPARK-12252][SPARK-12131][SQL] refactor MapObjects to make it less hacky	Wenchen Fan	2015-12-10	4	-47/+35
\| \| \| \| \| \| \| \| \| \| \| \|	in https://github.com/apache/spark/pull/10133 we found that, we shoud ensure the children of `TreeNode` are all accessible in the `productIterator`, or the behavior will be very confusing. In this PR, I try to fix this problem by expsing the `loopVar`. This also fixes SPARK-12131 which is caused by the hacky `MapObjects`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10239 from cloud-fan/map-objects.
*	[SPARK-11796] Fix httpclient and httpcore depedency issues related to ↵	Mark Grover	2015-12-09	1	-2/+0
\| \| \| \| \| \| \| \| \| \|	docker-client This commit fixes dependency issues which prevented the Docker-based JDBC integration tests from running in the Maven build. Author: Mark Grover <mgrover@cloudera.com> Closes #9876 from markgrover/master_docker.
*	[SPARK-12012][SQL] Show more comprehensive PhysicalRDD metadata when ↵	Cheng Lian	2015-12-09	10	-31/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	visualizing SQL query plan This PR adds a `private[sql]` method `metadata` to `SparkPlan`, which can be used to describe detail information about a physical plan during visualization. Specifically, this PR uses this method to provide details of `PhysicalRDD`s translated from a data source relation. For example, a `ParquetRelation` converted from Hive metastore table `default.psrc` is now shown as the following screenshot: ![image](https://cloud.githubusercontent.com/assets/230655/11526657/e10cb7e6-9916-11e5-9afa-f108932ec890.png) And here is the screenshot for a regular `ParquetRelation` (not converted from Hive metastore table) loaded from a really long path: ![output](https://cloud.githubusercontent.com/assets/230655/11680582/37c66460-9e94-11e5-8f50-842db5309d5a.png) Author: Cheng Lian <lian@databricks.com> Closes #10004 from liancheng/spark-12012.physical-rdd-metadata.
*	[SPARK-11676][SQL] Parquet filter tests all pass if filters are not really ↵	hyukjinkwon	2015-12-09	1	-28/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pushed down Currently Parquet predicate tests all pass even if filters are not pushed down or this is disabled. In this PR, For checking evaluating filters, Simply it makes the expression from `expression.Filter` and then try to create filters just like Spark does. For checking the results, this manually accesses to the child rdd (of `expression.Filter`) and produces the results which should be filtered properly, and then compares it to expected values. Now, if filters are not pushed down or this is disabled, this throws exceptions. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9659 from HyukjinKwon/SPARK-11676.
*	[SPARK-12069][SQL] Update documentation with Datasets	Michael Armbrust	2015-12-08	2	-4/+65
\| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #10060 from marmbrus/docs.
*	[SPARK-12205][SQL] Pivot fails Analysis when aggregate is UnresolvedFunction	Andrew Ray	2015-12-08	2	-1/+9
\| \| \| \| \| \| \| \|	Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #10202 from aray/sql-pivot-unresolved-function.
*	[SPARK-12188][SQL] Code refactoring and comment correction in Dataset APIs	gatorsmile	2015-12-08	1	-40/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR contains the following updates: - Created a new private variable `boundTEncoder` that can be shared by multiple functions, `RDD`, `select` and `collect`. - Replaced all the `queryExecution.analyzed` by the function call `logicalPlan` - A few API comments are using wrong class names (e.g., `DataFrame`) or parameter names (e.g., `n`) - A few API descriptions are wrong. (e.g., `mapPartitions`) marmbrus rxin cloud-fan Could you take a look and check if they are appropriate? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10184 from gatorsmile/datasetClean.
*	[SPARK-12195][SQL] Adding BigDecimal, Date and Timestamp into Encoder	gatorsmile	2015-12-08	2	-0/+35
\| \| \| \| \| \| \| \| \| \|	This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`. marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Closes #10188 from gatorsmile/dataTypesinEncoder.
*	[SPARK-12201][SQL] add type coercion rule for greatest/least	Wenchen Fan	2015-12-08	3	-0/+47
\| \| \| \| \| \| \| \| \|	checked with hive, greatest/least should cast their children to a tightest common type, i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error` Author: Wenchen Fan <wenchen@databricks.com> Closes #10196 from cloud-fan/type-coercion.
*	[SPARK-11884] Drop multiple columns in the DataFrame API	tedyu	2015-12-07	2	-8/+23
\| \| \| \| \| \| \| \| \| \| \|	See the thread Ben started: http://search-hadoop.com/m/q3RTtveEuhjsr7g/ This PR adds drop() method to DataFrame which accepts multiple column names Author: tedyu <yuzhihong@gmail.com> Closes #9862 from ted-yu/master.
*	[SPARK-12032] [SQL] Re-order inner joins to do join with conditions first	Davies Liu	2015-12-07	3	-6/+185
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow. This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions. After this patch, the TPCDS query Q64/65 can run hundreds times faster. cc marmbrus nongli Author: Davies Liu <davies@databricks.com> Closes #10073 from davies/reorder_joins.
*	[SPARK-12138][SQL] Escape \u in the generated comments of codegen	gatorsmile	2015-12-06	2	-1/+12
\| \| \| \| \| \| \| \| \| \|	When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u. yhuai Please review it. I did reproduce it and it works after the fix. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10155 from gatorsmile/escapeU.
*	[SPARK-12048][SQL] Prevent to close JDBC resources twice	gcc	2015-12-06	1	-0/+1
\| \| \| \| \| \|	Author: gcc <spark-src@condor.rhaag.ip> Closes #10101 from rh99/master.
*	[SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly	Shixiong Zhu	2015-12-04	4	-9/+14
\| \| \| \| \| \| \| \| \| \|	`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct. This patch fixed all places that use `ByteBuffer.array` incorrectly. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10083 from zsxwing/bytebuffer-array.
*	[SPARK-12112][BUILD] Upgrade to SBT 0.13.9	Josh Rosen	2015-12-05	5	-9/+9
\| \| \| \| \| \| \| \| \| \|	We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
*	[SPARK-6990][BUILD] Add Java linting script; fix minor warnings	Dmitry Erastov	2015-12-04	4	-29/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.
*	[SPARK-11206] Support SQL UI on the history server (resubmit)	Carson Wang	2015-12-03	13	-129/+271
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resubmit #9297 and #9991 On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution. To support SQL UI on the history server: 1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus. 2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module. 3. The new SQL events are written to event log using Jackson. 4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader. Author: Carson Wang <carson.wang@intel.com> Closes #10061 from carsonwang/SqlHistoryUI.
*	[SPARK-12088][SQL] check connection.isClosed before calling connection…	Huaxin Gao	2015-12-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	In Java Spec java.sql.Connection, it has boolean getAutoCommit() throws SQLException Throws: SQLException - if a database access error occurs or this method is called on a closed connection So if conn.getAutoCommit is called on a closed connection, a SQLException will be thrown. Even though the code catch the SQLException and program can continue, I think we should check conn.isClosed before calling conn.getAutoCommit to avoid the unnecessary SQLException. Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #10095 from huaxingao/spark-12088.
*	[SPARK-12109][SQL] Expressions's simpleString should delegate to its toString.	Yin Huai	2015-12-03	3	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-12109 The change of https://issues.apache.org/jira/browse/SPARK-11596 exposed the problem. In the sql plan viz, the filter shows ![image](https://cloud.githubusercontent.com/assets/2072857/11547075/1a285230-9906-11e5-8481-2bb451e35ef1.png) After changes in this PR, the viz is back to normal. ![image](https://cloud.githubusercontent.com/assets/2072857/11547080/2bc570f4-9906-11e5-8897-3b3bff173276.png) Author: Yin Huai <yhuai@databricks.com> Closes #10111 from yhuai/SPARK-12109.
*	[SPARK-12093][SQL] Fix the error of comment in DDLParser	Yadong Qi	2015-12-03	1	-3/+3
\| \| \| \| \| \|	Author: Yadong Qi <qiyadong2010@gmail.com> Closes #10096 from watermen/patch-1.
*	[SPARK-12094][SQL] Prettier tree string for TreeNode	Cheng Lian	2015-12-02	1	-5/+26
\| \| \| \| \| \| \| \|	When examining plans of complex queries with multiple joins, a pain point of mine is that, it's hard to immediately see the sibling node of a specific query plan node. This PR adds tree lines for the tree string of a `TreeNode`, so that the result can be visually more intuitive. Author: Cheng Lian <lian@databricks.com> Closes #10099 from liancheng/prettier-tree-string.
*	[SPARK-11949][SQL] Check bitmasks to set nullable property	Liang-Chi Hsieh	2015-12-01	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \|	Following up #10038. We can use bitmasks to determine which grouping expressions need to be set as nullable. cc yhuai Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10067 from viirya/fix-cube-following.
*	[SPARK-12077][SQL] change the default plan for single distinct	Davies Liu	2015-12-01	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	Use try to match the behavior for single distinct aggregation with Spark 1.5, but that's not scalable, we should be robust by default, have a flag to address performance regression for low cardinality aggregation. cc yhuai nongli Author: Davies Liu <davies@databricks.com> Closes #10075 from davies/agg_15.
*	[SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of ↵	Yin Huai	2015-12-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the current TreeNode, we should only return the simpleString. In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we will only return the simpleString. I tested the [following case provided by Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241). ``` val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>>>>>>>>>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) union.cache() Some(union) } c.get.explain(true) ``` Without the change, `c.get.explain(true)` took 100s. With the change, `c.get.explain(true)` took 26ms. https://issues.apache.org/jira/browse/SPARK-11596 Author: Yin Huai <yhuai@databricks.com> Closes #10079 from yhuai/SPARK-11596.
*	[SPARK-11352][SQL] Escape */ in the generated comments.	Yin Huai	2015-12-01	3	-3/+18
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai <yhuai@databricks.com> Closes #10072 from yhuai/SPARK-11352.
*	[SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source	Huaxin Gao	2015-12-01	2	-1/+14
\| \| \| \| \| \| \| \| \| \| \|	When query the Timestamp or Date column like the following val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" < end) The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" It should have quote around the Timestamp/Date value such as "TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'" Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9872 from huaxingao/spark-11788.
*	[SPARK-11328][SQL] Improve error message when hitting this issue	Nong Li	2015-12-01	2	-3/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The issue is that the output commiter is not idempotent and retry attempts will fail because the output file already exists. It is not safe to clean up the file as this output committer is by design not retryable. Currently, the job fails with a confusing file exists error. This patch is a stop gap to tell the user to look at the top of the error log for the proper message. This is difficult to test locally as Spark is hardcoded not to retry. Manually verified by upping the retry attempts. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10080 from nongli/spark-11328.
*	[SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up ↵	Josh Rosen	2015-12-02	3	-14/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes: - Avoid `TestHive.reset()` whenever possible: - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests. - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt. - Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary. After these changes, HiveCompatibilitySuite seems to run in about 10 minutes. This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite. Author: Josh Rosen <joshrosen@databricks.com> Closes #10055 from JoshRosen/speculative-testhive-reset.
*	[SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs	gatorsmile	2015-12-01	6	-18/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable? Please provide your opinions. marmbrus rxin cloud-fan Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #9889 from gatorsmile/persistDS.
*	[SPARK-11954][SQL] Encoder for JavaBeans	Wenchen Fan	2015-12-01	9	-20/+608
\| \| \| \| \| \| \| \| \| \| \|	create java version of `constructorFor` and `extractorFor` in `JavaTypeInference` Author: Wenchen Fan <wenchen@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9937 from cloud-fan/pojo.
*	[SPARK-11856][SQL] add type cast if the real type is different but ↵	Wenchen Fan	2015-12-01	10	-32/+335
\| \| \| \| \| \| \| \| \| \| \|	compatible with encoder schema When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema. For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string. Author: Wenchen Fan <wenchen@databricks.com> Closes #9840 from cloud-fan/err-msg.
*	[SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail	Wenchen Fan	2015-12-01	4	-7/+27
\| \| \| \| \| \| \| \|	The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping key, we should also combine the grouping attributes, although there is only one grouping attribute. Author: Wenchen Fan <wenchen@databricks.com> Closes #10059 from cloud-fan/bug.
*	[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues	Cheng Lian	2015-12-01	1	-5/+6
\| \| \| \| \| \| \| \|	This PR backports PR #10039 to master Author: Cheng Lian <lian@databricks.com> Closes #10063 from liancheng/spark-12046.doc-fix.master.
*	[SPARK-11949][SQL] Set field nullable property for GroupingSets to get ↵	Liang-Chi Hsieh	2015-12-01	2	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \|	correct results for null values JIRA: https://issues.apache.org/jira/browse/SPARK-11949 The result of cube plan uses incorrect schema. The schema of cube result should set nullable property to true because the grouping expressions will have null values. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10038 from viirya/fix-cube.
*	[SPARK-12018][SQL] Refactor common subexpression elimination code	Liang-Chi Hsieh	2015-11-30	3	-34/+14
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-12018 The code of common subexpression elimination can be factored and simplified. Some unnecessary variables can be removed. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10009 from viirya/refactor-subexpr-eliminate.
*	fix Maven build	Davies Liu	2015-11-30	1	-1/+1
\|
*	Revert "[SPARK-11206] Support SQL UI on the history server"	Josh Rosen	2015-11-30	13	-269/+129
\| \| \| \| \| \| \| \|	This reverts commit cc243a079b1c039d6e7f0b410d1654d94a090e14 / PR #9297 I'm reverting this because it broke SQLListenerMemoryLeakSuite in the master Maven builds. See #9991 for a discussion of why this broke the tests.
*	[SPARK-11982] [SQL] improve performance of cartesian product	Davies Liu	2015-11-30	2	-9/+69
\| \| \| \| \| \| \| \| \| \| \| \|	This PR improve the performance of CartesianProduct by caching the result of right plan. After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster). cc nongli Author: Davies Liu <davies@databricks.com> Closes #9969 from davies/improve_cartesian.
*	[SPARK-11700] [SQL] Remove thread local SQLContext in SparkPlan	Davies Liu	2015-11-30	6	-21/+14
\| \| \| \| \| \| \| \|	In 1.6, we introduce a public API to have a SQLContext for current thread, SparkPlan should use that. Author: Davies Liu <davies@databricks.com> Closes #9990 from davies/leak_context.