spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-11997] [SQL] NPE when save a DataFrame as parquet and partitioned by ↵	Dilip Biswal	2015-11-26	2	-1/+14
\| \| \| \| \| \| \| \| \| \|	long column Check for partition column null-ability while building the partition spec. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10001 from dilipbiswal/spark-11997.
*	Fix style violation for b63938a8b04	Reynold Xin	2015-11-26	1	-1/+3
\|
*	[SPARK-11778][SQL] add regression test	Huaxin Gao	2015-11-26	2	-10/+32
\| \| \| \| \| \| \| \| \| \| \|	Fix regression test for SPARK-11778. marmbrus Could you please take a look? Thank you very much!! Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9890 from huaxingao/spark-11778-regression-test.
*	[SPARK-11881][SQL] Fix for postgresql fetchsize > 0	mariusvniekerk	2015-11-26	3	-1/+39
\| \| \| \| \| \| \| \| \| \| \|	Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting. This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran. Author: mariusvniekerk <marius.v.niekerk@gmail.com> Closes #9861 from mariusvniekerk/SPARK-11881.
*	[SPARK-12011][SQL] Stddev/Variance etc should support columnName as arguments	Yanbo Liang	2015-11-26	2	-0/+89
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spark SQL aggregate function: ```Java stddev stddev_pop stddev_samp variance var_pop var_samp skewness kurtosis collect_list collect_set ``` should support ```columnName``` as arguments like other aggregate function(max/min/count/sum). Author: Yanbo Liang <ybliang8@gmail.com> Closes #9994 from yanboliang/SPARK-12011.
*	[SPARK-11973][SQL] Improve optimizer code readability.	Reynold Xin	2015-11-26	2	-26/+26
\| \| \| \| \| \| \| \| \| \|	This is a followup for https://github.com/apache/spark/pull/9959. I added more documentation and rewrote some monadic code into simpler ifs. Author: Reynold Xin <rxin@databricks.com> Closes #9995 from rxin/SPARK-11973.
*	[SPARK-11998][SQL][TEST-HADOOP2.0] When downloading Hadoop artifacts from ↵	Yin Huai	2015-11-26	3	-17/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	maven, we need to try to download the version that is used by Spark If we need to download Hive/Hadoop artifacts, try to download a Hadoop that matches the Hadoop used by Spark. If the Hadoop artifact cannot be resolved (e.g. Hadoop version is a vendor specific version like 2.0.0-cdh4.1.1), we will use Hadoop 2.4.0 (we used to hard code this version as the hadoop that we will download from maven) and we will not share Hadoop classes. I tested this match in my laptop with the following confs (these confs are used by our builds). All tests are good. ``` build/sbt -Phadoop-1 -Dhadoop.version=1.2.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.2 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive ``` Author: Yin Huai <yhuai@databricks.com> Closes #9979 from yhuai/versionsSuite.
*	[SPARK-11863][SQL] Unable to resolve order by if it contains mixture of ↵	Dilip Biswal	2015-11-26	2	-3/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aliases and real columns this is based on https://github.com/apache/spark/pull/9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <dbiswal@us.ibm.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #9961 from cloud-fan/sort.
*	[SPARK-12005][SQL] Work around VerifyError in HyperLogLogPlusPlus.	Marcelo Vanzin	2015-11-26	1	-5/+8
\| \| \| \| \| \| \| \|	Just move the code around a bit; that seems to make the JVM happy. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9985 from vanzin/SPARK-12005.
*	[SPARK-11973] [SQL] push filter through aggregation with alias and literals	Davies Liu	2015-11-26	3	-11/+79
\| \| \| \| \| \| \| \| \| \| \| \|	Currently, filter can't be pushed through aggregation with alias or literals, this patch fix that. After this patch, the time of TPC-DS query 4 go down to 13 seconds from 141 seconds (10x improvements). cc nongli yhuai Author: Davies Liu <davies@databricks.com> Closes #9959 from davies/push_filter2.
*	[SPARK-12003] [SQL] remove the prefix for name after expanded star	Davies Liu	2015-11-25	1	-1/+1
\| \| \| \| \| \| \| \|	Right now, the expended start will include the name of expression as prefix for column, that's not better than without expending, we should not have the prefix. Author: Davies Liu <davies@databricks.com> Closes #9984 from davies/expand_star.
*	[SPARK-11206] Support SQL UI on the history server	Carson Wang	2015-11-25	13	-129/+269
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution. To support SQL UI on the history server: 1. I added an `onOtherEvent` method to the `SparkListener` trait and post all SQL related events to the same event bus. 2. Two SQL events `SparkListenerSQLExecutionStart` and `SparkListenerSQLExecutionEnd` are defined in the sql module. 3. The new SQL events are written to event log using Jackson. 4. A new trait `SparkHistoryListenerFactory` is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using `java.util.ServiceLoader`. Author: Carson Wang <carson.wang@intel.com> Closes #9297 from carsonwang/SqlHistoryUI.
*	[SPARK-11983][SQL] remove all unused codegen fallback trait	Daoyuan Wang	2015-11-25	3	-6/+4
\| \| \| \| \| \|	Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9966 from adrian-wang/removeFallback.
*	Fix Aggregator documentation (rename present to finish).	Reynold Xin	2015-11-25	1	-1/+1
\|
*	[SPARK-11969] [SQL] [PYSPARK] visualization of SQL query for pyspark	Davies Liu	2015-11-25	2	-5/+14
\| \| \| \| \| \| \| \| \| \|	Currently, we does not have visualization for SQL query from Python, this PR fix that. cc zsxwing Author: Davies Liu <davies@databricks.com> Closes #9949 from davies/pyspark_sql_ui.
*	[SPARK-11984][SQL][PYTHON] Fix typos in doc for pivot for scala and python	felixcheung	2015-11-25	1	-3/+3
\| \| \| \| \| \|	Author: felixcheung <felixcheung_m@hotmail.com> Closes #9967 from felixcheung/pypivotdoc.
*	[SPARK-11981][SQL] Move implementations of methods back to DataFrame from ↵	Reynold Xin	2015-11-25	3	-33/+111
\| \| \| \| \| \| \| \| \| \|	Queryable Also added show methods to Dataset. Author: Reynold Xin <rxin@databricks.com> Closes #9964 from rxin/SPARK-11981.
*	[SPARK-11970][SQL] Adding JoinType into JoinWith and support Sample in ↵	gatorsmile	2015-11-25	2	-16/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dataset API Except inner join, maybe the other join types are also useful when users are using the joinWith function. Thus, added the joinType into the existing joinWith call in Dataset APIs. Also providing another joinWith interface for the cartesian-join-like functionality. Please provide your opinions. marmbrus rxin cloud-fan Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #9921 from gatorsmile/joinWith.
*	[SPARK-10621][SQL] Consistent naming for functions in SQL, Python, Scala	Reynold Xin	2015-11-24	1	-22/+102
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #9948 from rxin/SPARK-10621.
*	[SPARK-11947][SQL] Mark deprecated methods with "This will be removed in ↵	Reynold Xin	2015-11-24	8	-109/+170
\| \| \| \| \| \| \| \| \| \|	Spark 2.0." Also fixed some documentation as I saw them. Author: Reynold Xin <rxin@databricks.com> Closes #9930 from rxin/SPARK-11947.
*	[SPARK-11967][SQL] Consistent use of varargs for multiple paths in ↵	Reynold Xin	2015-11-24	4	-8/+54
\| \| \| \| \| \| \| \| \| \| \| \|	DataFrameReader This patch makes it consistent to use varargs in all DataFrameReader methods, including Parquet, JSON, text, and the generic load function. Also added a few more API tests for the Java API. Author: Reynold Xin <rxin@databricks.com> Closes #9945 from rxin/SPARK-11967.
*	[SPARK-11914][SQL] Support coalesce and repartition in Dataset APIs	gatorsmile	2015-11-24	2	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is to provide two common `coalesce` and `repartition` in Dataset APIs. After reading the comments of SPARK-9999, I am unclear about the plan for supporting re-partitioning in Dataset APIs. Currently, both RDD APIs and Dataframe APIs provide users such a flexibility to control the number of partitions. In most traditional RDBMS, they expose the number of partitions, the partitioning columns, the table partitioning methods to DBAs for performance tuning and storage planning. Normally, these parameters could largely affect the query performance. Since the actual performance depends on the workload types, I think it is almost impossible to automate the discovery of the best partitioning strategy for all the scenarios. I am wondering if Dataset APIs are planning to hide these APIs from users? Feel free to reject my PR if it does not match the plan. Thank you for your answers. marmbrus rxin cloud-fan Author: gatorsmile <gatorsmile@gmail.com> Closes #9899 from gatorsmile/coalesce.
*	[SPARK-11783][SQL] Fixes execution Hive client when using remote Hive metastore	Cheng Lian	2015-11-24	1	-0/+15
\| \| \| \| \| \| \| \|	When using remote Hive metastore, `hive.metastore.uris` is set to the metastore URI. However, it overrides `javax.jdo.option.ConnectionURL` unexpectedly, thus the execution Hive client connects to the actual remote Hive metastore instead of the Derby metastore created in the temporary directory. Cleaning this configuration for the execution Hive client fixes this issue. Author: Cheng Lian <lian@databricks.com> Closes #9895 from liancheng/spark-11783.clean-remote-metastore-config.
*	Added a line of comment to explain why the extra sort exists in pivot.	Reynold Xin	2015-11-24	1	-1/+1
\|
*	[SPARK-11946][SQL] Audit pivot API for 1.6.	Reynold Xin	2015-11-24	5	-76/+117
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently pivot's signature looks like ```scala scala.annotation.varargs def pivot(pivotColumn: Column, values: Column): GroupedData scala.annotation.varargs def pivot(pivotColumn: String, values: Any): GroupedData ``` I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List. I also made similar changes for Python. Author: Reynold Xin <rxin@databricks.com> Closes #9929 from rxin/SPARK-11946.
*	[SPARK-11926][SQL] unify GetStructField and GetInternalRowField	Wenchen Fan	2015-11-24	9	-42/+21
\| \| \| \| \| \|	Author: Wenchen Fan <wenchen@databricks.com> Closes #9909 from cloud-fan/get-struct.
*	[SPARK-11942][SQL] fix encoder life cycle for CoGroup	Wenchen Fan	2015-11-24	4	-22/+41
\| \| \| \| \| \| \| \|	we should pass in resolved encodera to logical `CoGroup` and bind them in physical `CoGroup` Author: Wenchen Fan <wenchen@databricks.com> Closes #9928 from cloud-fan/cogroup.
*	[SPARK-11592][SQL] flush spark-sql command line history to history file	Daoyuan Wang	2015-11-24	1	-0/+16
\| \| \| \| \| \| \| \|	Currently, `spark-sql` would not flush command history when exiting. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9563 from adrian-wang/jline.
*	[SPARK-11043][SQL] BugFix:Set the operator log in the thrift server.	huangzhaowei	2015-11-24	3	-5/+24
\| \| \| \| \| \| \| \| \| \|	`SessionManager` will set the `operationLog` if the configuration `hive.server2.logging.operation.enabled` is true in version of hive 1.2.1. But the spark did not adapt to this change, so no matter enabled the configuration or not, spark thrift server will always log the warn message. PS: if `hive.server2.logging.operation.enabled` is false, it should log the warn message (the same as hive thrift server). Author: huangzhaowei <carlmartinmax@gmail.com> Closes #9056 from SaintBacchus/SPARK-11043.
*	[SPARK-11897][SQL] Add @scala.annotations.varargs to sql functions	Xiu Guo	2015-11-24	1	-0/+2
\| \| \| \| \| \|	Author: Xiu Guo <xguo27@gmail.com> Closes #9918 from xguo27/SPARK-11897.
*	[SPARK-10707][SQL] Fix nullability computation in union output	Mikhail Bautin	2015-11-23	3	-5/+46
\| \| \| \| \| \|	Author: Mikhail Bautin <mbautin@gmail.com> Closes #9308 from mbautin/SPARK-10707.
*	[SPARK-11933][SQL] Rename mapGroup -> mapGroups and flatMapGroup -> ↵	Reynold Xin	2015-11-23	4	-31/+31
\| \| \| \| \| \| \| \| \| \| \| \|	flatMapGroups. Based on feedback from Matei, this is more consistent with mapPartitions in Spark. Also addresses some of the cleanups from a previous commit that renames the type variables. Author: Reynold Xin <rxin@databricks.com> Closes #9919 from rxin/SPARK-11933.
*	[SPARK-9866][SQL] Speed up VersionsSuite by using persistent Ivy cache	Josh Rosen	2015-11-23	1	-4/+6
\| \| \| \| \| \| \| \|	This patch attempts to speed up VersionsSuite by storing fetched Hive JARs in an Ivy cache that persists across tests runs. If `SPARK_VERSIONS_SUITE_IVY_PATH` is set, that path will be used for the cache; if it is not set, VersionsSuite will create a temporary Ivy cache which is deleted after the test completes. Author: Josh Rosen <joshrosen@databricks.com> Closes #9624 from JoshRosen/SPARK-9866.
*	[SPARK-11913][SQL] support typed aggregate with complex buffer schema	Wenchen Fan	2015-11-23	2	-10/+56
\| \| \| \| \| \|	Author: Wenchen Fan <wenchen@databricks.com> Closes #9898 from cloud-fan/agg.
*	[SPARK-11921][SQL] fix `nullable` of encoder schema	Wenchen Fan	2015-11-23	2	-3/+50
\| \| \| \| \| \|	Author: Wenchen Fan <wenchen@databricks.com> Closes #9906 from cloud-fan/nullable.
*	[SPARK-11894][SQL] fix isNull for GetInternalRowField	Wenchen Fan	2015-11-23	2	-15/+23
\| \| \| \| \| \| \| \| \| \|	We should use `InternalRow.isNullAt` to check if the field is null before calling `InternalRow.getXXX` Thanks gatorsmile who discovered this bug. Author: Wenchen Fan <wenchen@databricks.com> Closes #9904 from cloud-fan/null.
*	[SPARK-11628][SQL] support column datatype of char(x) to recognize HiveChar	Xiu Guo	2015-11-23	6	-7/+43
\| \| \| \| \| \| \| \| \|	Can someone review my code to make sure I'm not missing anything? Thanks! Author: Xiu Guo <xguo27@gmail.com> Author: Xiu Guo <guoxi@us.ibm.com> Closes #9612 from xguo27/SPARK-11628.
*	[SPARK-11908][SQL] Add NullType support to RowEncoder	Liang-Chi Hsieh	2015-11-22	3	-2/+9
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-11908 We should add NullType support to RowEncoder. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9891 from viirya/rowencoder-nulltype.
*	[SPARK-11899][SQL] API audit for GroupedDataset.	Reynold Xin	2015-11-21	8	-44/+130
\| \| \| \| \| \| \| \| \| \| \| \|	1. Renamed map to mapGroup, flatMap to flatMapGroup. 2. Renamed asKey -> keyAs. 3. Added more documentation. 4. Changed type parameter T to V on GroupedDataset. 5. Added since versions for all functions. Author: Reynold Xin <rxin@databricks.com> Closes #9880 from rxin/SPARK-11899.
*	[SPARK-11901][SQL] API audit for Aggregator.	Reynold Xin	2015-11-21	2	-16/+24
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #9882 from rxin/SPARK-11901.
*	[SPARK-11900][SQL] Add since version for all encoders	Reynold Xin	2015-11-21	1	-0/+63
\| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #9881 from rxin/SPARK-11900.
*	[SPARK-11819][SQL][FOLLOW-UP] fix scala 2.11 build	Wenchen Fan	2015-11-20	1	-2/+2
\| \| \| \| \| \| \| \|	seems scala 2.11 doesn't support: define private methods in `trait xxx` and use it in `object xxx extend xxx`. Author: Wenchen Fan <wenchen@databricks.com> Closes #9879 from cloud-fan/follow.
*	[HOTFIX] Fix Java Dataset Tests	Michael Armbrust	2015-11-20	1	-2/+2
\|
*	[SPARK-11890][SQL] Fix compilation for Scala 2.11	Michael Armbrust	2015-11-20	1	-2/+2
\| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #9871 from marmbrus/scala211-break.
*	[SPARK-11889][SQL] Fix type inference for GroupedDataset.agg in REPL	Michael Armbrust	2015-11-20	2	-29/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	In this PR I delete a method that breaks type inference for aggregators (only in the REPL) The error when this method is present is: ``` <console>:38: error: missing parameter type for expanded function ((x$2) => x$2._2) ds.groupBy(_._1).agg(sum(_._2), sum(_._3)).collect() ``` Author: Michael Armbrust <michael@databricks.com> Closes #9870 from marmbrus/dataset-repl-agg.
*	[SPARK-11787][SPARK-11883][SQL][FOLLOW-UP] Cleanup for this patch.	Nong Li	2015-11-20	8	-38/+409
\| \| \| \| \| \| \| \| \| \|	This mainly moves SqlNewHadoopRDD to the sql package. There is some state that is shared between core and I've left that in core. This allows some other associated minor cleanup. Author: Nong Li <nong@databricks.com> Closes #9845 from nongli/spark-11787.
*	[SPARK-11636][SQL] Support classes defined in the REPL with Encoders	Michael Armbrust	2015-11-20	1	-2/+2
\| \| \| \| \| \| \| \| \|	#theScaryParts (i.e. changes to the repl, executor classloaders and codegen)... Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #9825 from marmbrus/dataset-replClasses2.
*	[SPARK-11716][SQL] UDFRegistration just drops the input type when ↵	Jean-Baptiste Onofré	2015-11-20	2	-24/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	re-creating the UserDefinedFunction https://issues.apache.org/jira/browse/SPARK-11716 This is one is #9739 and a regression test. When commit it, please make sure the author is jbonofre. You can find the original PR at https://github.com/apache/spark/pull/9739 closes #9739 Author: Jean-Baptiste Onofré <jbonofre@apache.org> Author: Yin Huai <yhuai@databricks.com> Closes #9868 from yhuai/SPARK-11716.
*	[SPARK-11724][SQL] Change casting between int and timestamp to consistently ↵	Nong Li	2015-11-20	8	-25/+39
\| \| \| \| \| \| \| \| \| \| \| \|	treat int in seconds. Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454 Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Author: Yin Huai <yhuai@databricks.com> Closes #9685 from nongli/spark-11724.
*	[SPARK-11819][SQL] nice error message for missing encoder	Wenchen Fan	2015-11-20	2	-23/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`. After this PR, the error message become more friendly, for example: ``` No Encoder found for abc.xyz.NonEncodable - array element class: "abc.xyz.NonEncodable" - field (class: "scala.Array", name: "arrayField") - root class: "abc.xyz.AnotherClass" ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #9810 from cloud-fan/error-message.