spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-11884] Drop multiple columns in the DataFrame API	tedyu	2015-12-07	2	-8/+23
\| \| \| \| \| \| \| \| \| \| \|	See the thread Ben started: http://search-hadoop.com/m/q3RTtveEuhjsr7g/ This PR adds drop() method to DataFrame which accepts multiple column names Author: tedyu <yuzhihong@gmail.com> Closes #9862 from ted-yu/master.
*	[SPARK-12032] [SQL] Re-order inner joins to do join with conditions first	Davies Liu	2015-12-07	3	-6/+185
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow. This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions. After this patch, the TPCDS query Q64/65 can run hundreds times faster. cc marmbrus nongli Author: Davies Liu <davies@databricks.com> Closes #10073 from davies/reorder_joins.
*	[SPARK-12138][SQL] Escape \u in the generated comments of codegen	gatorsmile	2015-12-06	2	-1/+12
\| \| \| \| \| \| \| \| \| \|	When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u. yhuai Please review it. I did reproduce it and it works after the fix. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10155 from gatorsmile/escapeU.
*	[SPARK-12048][SQL] Prevent to close JDBC resources twice	gcc	2015-12-06	1	-0/+1
\| \| \| \| \| \|	Author: gcc <spark-src@condor.rhaag.ip> Closes #10101 from rh99/master.
*	[SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly	Shixiong Zhu	2015-12-04	4	-9/+14
\| \| \| \| \| \| \| \| \| \|	`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct. This patch fixed all places that use `ByteBuffer.array` incorrectly. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10083 from zsxwing/bytebuffer-array.
*	[SPARK-12112][BUILD] Upgrade to SBT 0.13.9	Josh Rosen	2015-12-05	5	-9/+9
\| \| \| \| \| \| \| \| \| \|	We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
*	[SPARK-6990][BUILD] Add Java linting script; fix minor warnings	Dmitry Erastov	2015-12-04	4	-29/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.
*	[SPARK-11206] Support SQL UI on the history server (resubmit)	Carson Wang	2015-12-03	13	-129/+271
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Resubmit #9297 and #9991 On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution. To support SQL UI on the history server: 1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus. 2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module. 3. The new SQL events are written to event log using Jackson. 4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader. Author: Carson Wang <carson.wang@intel.com> Closes #10061 from carsonwang/SqlHistoryUI.
*	[SPARK-12088][SQL] check connection.isClosed before calling connection…	Huaxin Gao	2015-12-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	In Java Spec java.sql.Connection, it has boolean getAutoCommit() throws SQLException Throws: SQLException - if a database access error occurs or this method is called on a closed connection So if conn.getAutoCommit is called on a closed connection, a SQLException will be thrown. Even though the code catch the SQLException and program can continue, I think we should check conn.isClosed before calling conn.getAutoCommit to avoid the unnecessary SQLException. Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #10095 from huaxingao/spark-12088.
*	[SPARK-12109][SQL] Expressions's simpleString should delegate to its toString.	Yin Huai	2015-12-03	3	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-12109 The change of https://issues.apache.org/jira/browse/SPARK-11596 exposed the problem. In the sql plan viz, the filter shows ![image](https://cloud.githubusercontent.com/assets/2072857/11547075/1a285230-9906-11e5-8481-2bb451e35ef1.png) After changes in this PR, the viz is back to normal. ![image](https://cloud.githubusercontent.com/assets/2072857/11547080/2bc570f4-9906-11e5-8897-3b3bff173276.png) Author: Yin Huai <yhuai@databricks.com> Closes #10111 from yhuai/SPARK-12109.
*	[SPARK-12093][SQL] Fix the error of comment in DDLParser	Yadong Qi	2015-12-03	1	-3/+3
\| \| \| \| \| \|	Author: Yadong Qi <qiyadong2010@gmail.com> Closes #10096 from watermen/patch-1.
*	[SPARK-12094][SQL] Prettier tree string for TreeNode	Cheng Lian	2015-12-02	1	-5/+26
\| \| \| \| \| \| \| \|	When examining plans of complex queries with multiple joins, a pain point of mine is that, it's hard to immediately see the sibling node of a specific query plan node. This PR adds tree lines for the tree string of a `TreeNode`, so that the result can be visually more intuitive. Author: Cheng Lian <lian@databricks.com> Closes #10099 from liancheng/prettier-tree-string.
*	[SPARK-11949][SQL] Check bitmasks to set nullable property	Liang-Chi Hsieh	2015-12-01	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \|	Following up #10038. We can use bitmasks to determine which grouping expressions need to be set as nullable. cc yhuai Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10067 from viirya/fix-cube-following.
*	[SPARK-12077][SQL] change the default plan for single distinct	Davies Liu	2015-12-01	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	Use try to match the behavior for single distinct aggregation with Spark 1.5, but that's not scalable, we should be robust by default, have a flag to address performance regression for low cardinality aggregation. cc yhuai nongli Author: Davies Liu <davies@databricks.com> Closes #10075 from davies/agg_15.
*	[SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of ↵	Yin Huai	2015-12-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the current TreeNode, we should only return the simpleString. In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we will only return the simpleString. I tested the [following case provided by Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241). ``` val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>>>>>>>>>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) union.cache() Some(union) } c.get.explain(true) ``` Without the change, `c.get.explain(true)` took 100s. With the change, `c.get.explain(true)` took 26ms. https://issues.apache.org/jira/browse/SPARK-11596 Author: Yin Huai <yhuai@databricks.com> Closes #10079 from yhuai/SPARK-11596.
*	[SPARK-11352][SQL] Escape */ in the generated comments.	Yin Huai	2015-12-01	3	-3/+18
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai <yhuai@databricks.com> Closes #10072 from yhuai/SPARK-11352.
*	[SPARK-11788][SQL] surround timestamp/date value with quotes in JDBC data source	Huaxin Gao	2015-12-01	2	-1/+14
\| \| \| \| \| \| \| \| \| \| \|	When query the Timestamp or Date column like the following val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" < end) The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" It should have quote around the Timestamp/Date value such as "TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'" Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9872 from huaxingao/spark-11788.
*	[SPARK-11328][SQL] Improve error message when hitting this issue	Nong Li	2015-12-01	2	-3/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The issue is that the output commiter is not idempotent and retry attempts will fail because the output file already exists. It is not safe to clean up the file as this output committer is by design not retryable. Currently, the job fails with a confusing file exists error. This patch is a stop gap to tell the user to look at the top of the error log for the proper message. This is difficult to test locally as Spark is hardcoded not to retry. Manually verified by upping the retry attempts. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10080 from nongli/spark-11328.
*	[SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up ↵	Josh Rosen	2015-12-02	3	-14/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes: - Avoid `TestHive.reset()` whenever possible: - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests. - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt. - Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary. After these changes, HiveCompatibilitySuite seems to run in about 10 minutes. This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite. Author: Josh Rosen <joshrosen@databricks.com> Closes #10055 from JoshRosen/speculative-testhive-reset.
*	[SPARK-11905][SQL] Support Persist/Cache and Unpersist in Dataset APIs	gatorsmile	2015-12-01	6	-18/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable? Please provide your opinions. marmbrus rxin cloud-fan Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #9889 from gatorsmile/persistDS.
*	[SPARK-11954][SQL] Encoder for JavaBeans	Wenchen Fan	2015-12-01	9	-20/+608
\| \| \| \| \| \| \| \| \| \| \|	create java version of `constructorFor` and `extractorFor` in `JavaTypeInference` Author: Wenchen Fan <wenchen@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9937 from cloud-fan/pojo.
*	[SPARK-11856][SQL] add type cast if the real type is different but ↵	Wenchen Fan	2015-12-01	10	-32/+335
\| \| \| \| \| \| \| \| \| \| \|	compatible with encoder schema When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema. For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string. Author: Wenchen Fan <wenchen@databricks.com> Closes #9840 from cloud-fan/err-msg.
*	[SPARK-12068][SQL] use a single column in Dataset.groupBy and count will fail	Wenchen Fan	2015-12-01	4	-7/+27
\| \| \| \| \| \| \| \|	The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping key, we should also combine the grouping attributes, although there is only one grouping attribute. Author: Wenchen Fan <wenchen@databricks.com> Closes #10059 from cloud-fan/bug.
*	[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues	Cheng Lian	2015-12-01	1	-5/+6
\| \| \| \| \| \| \| \|	This PR backports PR #10039 to master Author: Cheng Lian <lian@databricks.com> Closes #10063 from liancheng/spark-12046.doc-fix.master.
*	[SPARK-11949][SQL] Set field nullable property for GroupingSets to get ↵	Liang-Chi Hsieh	2015-12-01	2	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \|	correct results for null values JIRA: https://issues.apache.org/jira/browse/SPARK-11949 The result of cube plan uses incorrect schema. The schema of cube result should set nullable property to true because the grouping expressions will have null values. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10038 from viirya/fix-cube.
*	[SPARK-12018][SQL] Refactor common subexpression elimination code	Liang-Chi Hsieh	2015-11-30	3	-34/+14
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-12018 The code of common subexpression elimination can be factored and simplified. Some unnecessary variables can be removed. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10009 from viirya/refactor-subexpr-eliminate.
*	fix Maven build	Davies Liu	2015-11-30	1	-1/+1
\|
*	Revert "[SPARK-11206] Support SQL UI on the history server"	Josh Rosen	2015-11-30	13	-269/+129
\| \| \| \| \| \| \| \|	This reverts commit cc243a079b1c039d6e7f0b410d1654d94a090e14 / PR #9297 I'm reverting this because it broke SQLListenerMemoryLeakSuite in the master Maven builds. See #9991 for a discussion of why this broke the tests.
*	[SPARK-11982] [SQL] improve performance of cartesian product	Davies Liu	2015-11-30	2	-9/+69
\| \| \| \| \| \| \| \| \| \| \| \|	This PR improve the performance of CartesianProduct by caching the result of right plan. After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster). cc nongli Author: Davies Liu <davies@databricks.com> Closes #9969 from davies/improve_cartesian.
*	[SPARK-11700] [SQL] Remove thread local SQLContext in SparkPlan	Davies Liu	2015-11-30	6	-21/+14
\| \| \| \| \| \| \| \|	In 1.6, we introduce a public API to have a SQLContext for current thread, SparkPlan should use that. Author: Davies Liu <davies@databricks.com> Closes #9990 from davies/leak_context.
*	[SPARK-11989][SQL] Only use commit in JDBC data source if the underlying ↵	CK50	2015-11-30	1	-3/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	database supports transactions Fixes [SPARK-11989](https://issues.apache.org/jira/browse/SPARK-11989) Author: CK50 <christian.kurz@oracle.com> Author: Christian Kurz <christian.kurz@oracle.com> Closes #9973 from CK50/branch-1.6_non-transactional. (cherry picked from commit a589736a1b237ef2f3bd59fbaeefe143ddcc8f4e) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-12039] [SQL] Ignore HiveSparkSubmitSuite's "SPARK-9757 Persist ↵	Yin Huai	2015-11-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Parquet relation with decimal column". https://issues.apache.org/jira/browse/SPARK-12039 Since it is pretty flaky in hadoop 1 tests, we can disable it while we are investigating the cause. Author: Yin Huai <yhuai@databricks.com> Closes #10035 from yhuai/SPARK-12039-ignore.
*	[SPARK-12024][SQL] More efficient multi-column counting.	Herman van Hovell	2015-11-29	6	-86/+33
\| \| \| \| \| \| \| \| \| \| \| \|	In https://github.com/apache/spark/pull/9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null. This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10015 from hvanhovell/SPARK-12024.
*	[SPARK-12028] [SQL] get_json_object returns an incorrect result when the ↵	gatorsmile	2015-11-27	2	-2/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	value is null literals When calling `get_json_object` for the following two cases, both results are `"null"`: ```scala val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil val df: DataFrame = tuple.toDF("key", "jstring") val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` ```scala val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil val df2: DataFrame = tuple2.toDF("key", "jstring") val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` Fixed the problem and also added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #10018 from gatorsmile/get_json_object.
*	[SPARK-11997] [SQL] NPE when save a DataFrame as parquet and partitioned by ↵	Dilip Biswal	2015-11-26	2	-1/+14
\| \| \| \| \| \| \| \| \| \|	long column Check for partition column null-ability while building the partition spec. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10001 from dilipbiswal/spark-11997.
*	Fix style violation for b63938a8b04	Reynold Xin	2015-11-26	1	-1/+3
\|
*	[SPARK-11778][SQL] add regression test	Huaxin Gao	2015-11-26	2	-10/+32
\| \| \| \| \| \| \| \| \| \| \|	Fix regression test for SPARK-11778. marmbrus Could you please take a look? Thank you very much!! Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9890 from huaxingao/spark-11778-regression-test.
*	[SPARK-11881][SQL] Fix for postgresql fetchsize > 0	mariusvniekerk	2015-11-26	3	-1/+39
\| \| \| \| \| \| \| \| \| \| \|	Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting. This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran. Author: mariusvniekerk <marius.v.niekerk@gmail.com> Closes #9861 from mariusvniekerk/SPARK-11881.
*	[SPARK-12011][SQL] Stddev/Variance etc should support columnName as arguments	Yanbo Liang	2015-11-26	2	-0/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL aggregate function: ```Java stddev stddev_pop stddev_samp variance var_pop var_samp skewness kurtosis collect_list collect_set ``` should support ```columnName``` as arguments like other aggregate function(max/min/count/sum). Author: Yanbo Liang <ybliang8@gmail.com> Closes #9994 from yanboliang/SPARK-12011.
*	[SPARK-11973][SQL] Improve optimizer code readability.	Reynold Xin	2015-11-26	2	-26/+26
\| \| \| \| \| \| \| \| \| \|	This is a followup for https://github.com/apache/spark/pull/9959. I added more documentation and rewrote some monadic code into simpler ifs. Author: Reynold Xin <rxin@databricks.com> Closes #9995 from rxin/SPARK-11973.
*	[SPARK-11998][SQL][TEST-HADOOP2.0] When downloading Hadoop artifacts from ↵	Yin Huai	2015-11-26	3	-17/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	maven, we need to try to download the version that is used by Spark If we need to download Hive/Hadoop artifacts, try to download a Hadoop that matches the Hadoop used by Spark. If the Hadoop artifact cannot be resolved (e.g. Hadoop version is a vendor specific version like 2.0.0-cdh4.1.1), we will use Hadoop 2.4.0 (we used to hard code this version as the hadoop that we will download from maven) and we will not share Hadoop classes. I tested this match in my laptop with the following confs (these confs are used by our builds). All tests are good. ``` build/sbt -Phadoop-1 -Dhadoop.version=1.2.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.2 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive ``` Author: Yin Huai <yhuai@databricks.com> Closes #9979 from yhuai/versionsSuite.
*	[SPARK-11863][SQL] Unable to resolve order by if it contains mixture of ↵	Dilip Biswal	2015-11-26	2	-3/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aliases and real columns this is based on https://github.com/apache/spark/pull/9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <dbiswal@us.ibm.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #9961 from cloud-fan/sort.
*	[SPARK-12005][SQL] Work around VerifyError in HyperLogLogPlusPlus.	Marcelo Vanzin	2015-11-26	1	-5/+8
\| \| \| \| \| \| \| \|	Just move the code around a bit; that seems to make the JVM happy. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9985 from vanzin/SPARK-12005.
*	[SPARK-11973] [SQL] push filter through aggregation with alias and literals	Davies Liu	2015-11-26	3	-11/+79
\| \| \| \| \| \| \| \| \| \| \| \|	Currently, filter can't be pushed through aggregation with alias or literals, this patch fix that. After this patch, the time of TPC-DS query 4 go down to 13 seconds from 141 seconds (10x improvements). cc nongli yhuai Author: Davies Liu <davies@databricks.com> Closes #9959 from davies/push_filter2.
*	[SPARK-12003] [SQL] remove the prefix for name after expanded star	Davies Liu	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \|	Right now, the expended start will include the name of expression as prefix for column, that's not better than without expending, we should not have the prefix. Author: Davies Liu <davies@databricks.com> Closes #9984 from davies/expand_star.
*	[SPARK-11206] Support SQL UI on the history server	Carson Wang	2015-11-25	13	-129/+269
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution. To support SQL UI on the history server: 1. I added an `onOtherEvent` method to the `SparkListener` trait and post all SQL related events to the same event bus. 2. Two SQL events `SparkListenerSQLExecutionStart` and `SparkListenerSQLExecutionEnd` are defined in the sql module. 3. The new SQL events are written to event log using Jackson. 4. A new trait `SparkHistoryListenerFactory` is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using `java.util.ServiceLoader`. Author: Carson Wang <carson.wang@intel.com> Closes #9297 from carsonwang/SqlHistoryUI.
*	[SPARK-11983][SQL] remove all unused codegen fallback trait	Daoyuan Wang	2015-11-25	3	-6/+4
\| \| \| \| \| \|	Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9966 from adrian-wang/removeFallback.
*	Fix Aggregator documentation (rename present to finish).	Reynold Xin	2015-11-25	1	-1/+1
\|
*	[SPARK-11969] [SQL] [PYSPARK] visualization of SQL query for pyspark	Davies Liu	2015-11-25	2	-5/+14
\| \| \| \| \| \| \| \| \| \|	Currently, we does not have visualization for SQL query from Python, this PR fix that. cc zsxwing Author: Davies Liu <davies@databricks.com> Closes #9949 from davies/pyspark_sql_ui.
*	[SPARK-11984][SQL][PYTHON] Fix typos in doc for pivot for scala and python	felixcheung	2015-11-25	1	-3/+3
\| \| \| \| \| \|	Author: felixcheung <felixcheung_m@hotmail.com> Closes #9967 from felixcheung/pypivotdoc.