spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-11810][SQL] Java-based encoder for opaque types in Datasets.	Reynold Xin	2015-11-18	4	-41/+130
\| \| \| \| \| \| \| \|	This patch refactors the existing Kryo encoder expressions and adds support for Java serialization. Author: Reynold Xin <rxin@databricks.com> Closes #9802 from rxin/SPARK-11810.
*	[SPARK-11544][SQL] sqlContext doesn't use PathFilter	Dilip Biswal	2015-11-18	2	-7/+54
\| \| \| \| \| \| \| \|	Apply the user supplied pathfilter while retrieving the files from fs. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9652 from dilipbiswal/spark-11544.
*	[SPARK-11720][SQL][ML] Handle edge cases when count = 0 or 1 for Stats function	JihongMa	2015-11-18	7	-24/+52
\| \| \| \| \| \| \| \|	return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null. Author: JihongMa <linlin200605@gmail.com> Closes #9705 from JihongMA/SPARK-11720.
*	[SPARK-11739][SQL] clear the instantiated SQLContext	Davies Liu	2015-11-18	3	-10/+14
\| \| \| \| \| \| \| \|	Currently, if the first SQLContext is not removed after stopping SparkContext, a SQLContext could set there forever. This patch make this more robust. Author: Davies Liu <davies@databricks.com> Closes #9706 from davies/clear_context.
*	[SPARK-11792] [SQL] [FOLLOW-UP] Change SizeEstimation to KnownSizeEstimation ↵	Yin Huai	2015-11-18	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and make estimatedSize return Long instead of Option[Long] https://issues.apache.org/jira/browse/SPARK-11792 The main changes include: * Renaming `SizeEstimation` to `KnownSizeEstimation`. Hopefully this new name has more information. * Making `estimatedSize` return `Long` instead of `Option[Long]`. * In `UnsaveHashedRelation`, `estimatedSize` will delegate the work to `SizeEstimator` if we have not created a `BytesToBytesMap`. Since we will put `UnsaveHashedRelation` to `BlockManager`, it is generally good to let it provide a more accurate size estimation. Also, if we do not put `BytesToBytesMap` directly into `BlockerManager`, I feel it is not really necessary to make `BytesToBytesMap` extends `KnownSizeEstimation`. Author: Yin Huai <yhuai@databricks.com> Closes #9813 from yhuai/SPARK-11792-followup.
*	[SPARK-11795][SQL] combine grouping attributes into a single NamedExpression	Wenchen Fan	2015-11-18	2	-5/+9
\| \| \| \| \| \| \| \| \|	we use `ExpressionEncoder.tuple` to build the result encoder, which assumes the input encoder should point to a struct type field if it’s non-flat. However, our keyEncoder always point to a flat field/fields: `groupingAttributes`, we should combine them into a single `NamedExpression`. Author: Wenchen Fan <wenchen@databricks.com> Closes #9792 from cloud-fan/agg.
*	[SPARK-11725][SQL] correctly handle null inputs for UDF	Wenchen Fan	2015-11-18	6	-1/+121
\| \| \| \| \| \| \| \|	If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null. Author: Wenchen Fan <wenchen@databricks.com> Closes #9770 from cloud-fan/udf.
*	[SPARK-11803][SQL] fix Dataset self-join	Wenchen Fan	2015-11-18	2	-9/+13
\| \| \| \| \| \| \| \|	When we resolve the join operator, we may change the output of right side if self-join is detected. So in `Dataset.joinWith`, we should resolve the join operator first, and then get the left output and right output from it, instead of using `left.output` and `right.output` directly. Author: Wenchen Fan <wenchen@databricks.com> Closes #9806 from cloud-fan/self-join.
*	[SPARK-10946][SQL] JDBC - Use Statement.executeUpdate instead of ↵	somideshmukh	2015-11-18	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	PreparedStatement.executeUpdate for DDLs New changes with JDBCRDD Author: somideshmukh <somilde@us.ibm.com> Closes #9733 from somideshmukh/SomilBranch-1.1.
*	[SPARK-11792][SQL] SizeEstimator cannot provide a good size estimation of ↵	Yin Huai	2015-11-18	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \|	UnsafeHashedRelations https://issues.apache.org/jira/browse/SPARK-11792 Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs. Author: Yin Huai <yhuai@databricks.com> Closes #9788 from yhuai/SPARK-11792.
*	[SPARK-11802][SQL] Kryo-based encoder for opaque types in Datasets	Reynold Xin	2015-11-18	8	-23/+178
\| \| \| \| \| \| \| \|	I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803. Author: Reynold Xin <rxin@databricks.com> Closes #9789 from rxin/SPARK-11802.
*	[SPARK-11643] [SQL] parse year with leading zero	Davies Liu	2015-11-17	2	-5/+32
\| \| \| \| \| \| \| \|	Support the years between 0 <= year < 1000 Author: Davies Liu <davies@databricks.com> Closes #9701 from davies/leading_zero.
*	[SPARK-11797][SQL] collect, first, and take should use encoders for ↵	Reynold Xin	2015-11-17	2	-6/+41
\| \| \| \| \| \| \| \| \| \|	serialization They were previously using Spark's default serializer for serialization. Author: Reynold Xin <rxin@databricks.com> Closes #9787 from rxin/SPARK-11797.
*	[SPARK-11793][SQL] Dataset should set the resolved encoders internally for maps.	Reynold Xin	2015-11-17	2	-1/+13
\| \| \| \| \| \| \| \|	I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795. Author: Reynold Xin <rxin@databricks.com> Closes #9784 from rxin/SPARK-11503.
*	[SPARK-11767] [SQL] limit the size of caced batch	Davies Liu	2015-11-17	3	-4/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management). This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns). This also change the way to grow buffer, double it each time, then trim it once finished. cc liancheng Author: Davies Liu <davies@databricks.com> Closes #9760 from davies/cache_limit.
*	[SPARK-10186][SQL] support postgre array type in JDBCRDD	Wenchen Fan	2015-11-17	4	-69/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add ARRAY support to `PostgresDialect`. Nested ARRAY is not allowed for now because it's hard to get the array dimension info. See http://stackoverflow.com/questions/16619113/how-to-get-array-base-type-in-postgres-via-jdbc Thanks for the initial work from mariusvniekerk ! Close https://github.com/apache/spark/pull/9137 Author: Wenchen Fan <wenchen@databricks.com> Closes #9662 from cloud-fan/postgre.
*	[SPARK-8658][SQL][FOLLOW-UP] AttributeReference's equals method compares all ↵	gatorsmile	2015-11-17	2	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	the members Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers. Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it. marmbrus cloud-fan Please review if the changes are good. Author: gatorsmile <gatorsmile@gmail.com> Closes #9761 from gatorsmile/hashCodeNamedExpression.
*	[SPARK-11089][SQL] Adds option for disabling multi-session in Thrift server	Cheng Lian	2015-11-17	3	-2/+58
\| \| \| \| \| \| \| \| \| \|	This PR adds a new option `spark.sql.hive.thriftServer.singleSession` for disabling multi-session support in the Thrift server. Note that this option is added as a Spark configuration (retrieved from `SparkConf`) rather than Spark SQL configuration (retrieved from `SQLConf`). This is because all SQL configurations are session-ized. Since multi-session support is by default on, no JDBC connection can modify global configurations like the newly added one. Author: Cheng Lian <lian@databricks.com> Closes #9740 from liancheng/spark-11089.single-session-option.
*	[SPARK-11679][SQL] Invoking method " apply(fields: ↵	mayuanwen	2015-11-17	2	-1/+15
\| \| \| \| \| \| \| \| \| \| \|	java.util.List[StructField])" in "StructType" gets ClassCastException In the previous method, fields.toArray will cast java.util.List[StructField] into Array[Object] which can not cast into Array[StructField], thus when invoking this method will throw "java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.sql.types.StructField;" I directly cast java.util.List[StructField] into Array[StructField] in this patch. Author: mayuanwen <mayuanwen@qiyi.com> Closes #9649 from jackieMaKing/Spark-11679.
*	[SPARK-11191][SQL][FOLLOW-UP] Cleans up unnecessary anonymous ↵	Cheng Lian	2015-11-17	2	-11/+6
\| \| \| \| \| \| \| \| \| \|	HiveFunctionRegistry According to discussion in PR #9664, the anonymous `HiveFunctionRegistry` in `HiveContext` can be removed now. Author: Cheng Lian <lian@databricks.com> Closes #9737 from liancheng/spark-11191.follow-up.
*	[MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuite	Liang-Chi Hsieh	2015-11-16	1	-1/+8
\| \| \| \| \| \| \| \|	The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9757 from viirya/fix-randomgenerated-udt.
*	[SPARK-11447][SQL] change NullType to StringType during binaryComparison ↵	Kevin Yu	2015-11-16	2	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	between NullType and StringType During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira. I proposal to the changes through this PR, can you review my code changes ? This problem only happen for <=>, other operators works fine. scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> df.registerTempTable("DF") scala> sqlContext.sql("select * from DF where 'column' = NULL") res27: org.apache.spark.sql.DataFrame = [column: string] scala> res27.show +------+ \|column\| +------+ +------+ Author: Kevin Yu <qyu@us.ibm.com> Closes #9720 from kevinyu98/working_on_spark-11447.
*	[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for ↵	hyukjinkwon	2015-11-17	2	-27/+15
\| \| \| \| \| \| \| \| \| \| \| \|	metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed https://github.com/apache/spark/pull/9660 https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.
*	[SPARK-11768][SPARK-9196][SQL] Support now function in SQL (alias for ↵	Reynold Xin	2015-11-16	2	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \|	current_timestamp). This patch adds an alias for current_timestamp (now function). Also fixes SPARK-9196 to re-enable the test case for current_timestamp. Author: Reynold Xin <rxin@databricks.com> Closes #9753 from rxin/SPARK-11768.
*	[SPARK-11625][SQL] add java test for typed aggregate	Wenchen Fan	2015-11-16	4	-8/+91
\| \| \| \| \| \|	Author: Wenchen Fan <wenchen@databricks.com> Closes #9591 from cloud-fan/agg-test.
*	[SPARK-8658][SQL] AttributeReference's equals method compares all the members	gatorsmile	2015-11-16	3	-12/+14
\| \| \| \| \| \| \| \|	This fix is to change the equals method to check all of the specified fields for equality of AttributeReference. Author: gatorsmile <gatorsmile@gmail.com> Closes #9216 from gatorsmile/namedExpressEqual.
*	[SPARK-11553][SQL] Primitive Row accessors should not convert null to ↵	Bartlomiej Alberski	2015-11-16	3	-23/+65
\| \| \| \| \| \| \| \| \| \|	default value Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details. Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com> Closes #9642 from alberskib/bugfix/SPARK-11553.
*	[SPARK-11390][SQL] Query plan with/without filterPushdown indistinguishable	Zee Chen	2015-11-16	3	-4/+22
\| \| \| \| \| \| \| \| \| \|	…ishable Propagate pushed filters to PhyicalRDD in DataSourceStrategy.apply Author: Zee Chen <zeechen@us.ibm.com> Closes #9679 from zeocio/spark-11390.
*	[SPARK-11754][SQL] consolidate `ExpressionEncoder.tuple` and `Encoders.tuple`	Wenchen Fan	2015-11-16	3	-120/+108
\| \| \| \| \| \| \| \| \| \|	These 2 are very similar, we can consolidate them into one. Also add tests for it and fix a bug. Author: Wenchen Fan <wenchen@databricks.com> Closes #9729 from cloud-fan/tuple.
*	[SPARK-11743] [SQL] Add UserDefinedType support to RowEncoder	Liang-Chi Hsieh	2015-11-16	4	-29/+139
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-11743 RowEncoder doesn't support UserDefinedType now. We should add the support for it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9712 from viirya/rowencoder-udt.
*	[SPARK-11752] [SQL] fix timezone problem for DateTimeUtils.getSeconds	Wenchen Fan	2015-11-16	2	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	code snippet to reproduce it: ``` TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai")) val t = Timestamp.valueOf("1900-06-11 12:14:50.789") val us = fromJavaTimestamp(t) assert(getSeconds(us) === t.getSeconds) ``` it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed. Author: Wenchen Fan <wenchen@databricks.com> Closes #9728 from cloud-fan/seconds.
*	[SPARK-11522][SQL] input_file_name() returns "" for external tables	xin Wu	2015-11-16	1	-2/+91
\| \| \| \| \| \| \| \| \|	When computing partition for non-parquet relation, `HadoopRDD.compute` is used. but it does not set the thread local variable `inputFileName` in `NewSqlHadoopRDD`, like `NewSqlHadoopRDD.compute` does.. Yet, when getting the `inputFileName`, `NewSqlHadoopRDD.inputFileName` is exptected, which is empty now. Adding the setting inputFileName in HadoopRDD.compute resolves this issue. Author: xin Wu <xinwu@us.ibm.com> Closes #9542 from xwu0226/SPARK-11522.
*	[SPARK-11692][SQL] Support for Parquet logical types, JSON and BSON ↵	hyukjinkwon	2015-11-16	2	-1/+27
\| \| \| \| \| \| \| \| \| \| \| \| \|	(embedded types) Parquet supports some JSON and BSON datatypes. They are represented as binary for BSON and string (UTF-8) for JSON internally. I searched a bit and found Apache drill also supports both in this way, [link](https://drill.apache.org/docs/parquet-format/). Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #9658 from HyukjinKwon/SPARK-11692.
*	[SPARK-11044][SQL] Parquet writer version fixed as version1	hyukjinkwon	2015-11-16	2	-1/+35
\| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11044 Spark writes a parquet file only with writer version1 ignoring the writer version given by user. So, in this PR, it keeps the writer version if given or sets version1 as default. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #9060 from HyukjinKwon/SPARK-11044.
*	[SPARK-11745][SQL] Enable more JSON parsing options	Reynold Xin	2015-11-16	8	-106/+276
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds the following options to the JSON data source, for dealing with non-standard JSON files: * `allowComments` (default `false`): ignores Java/C++ style comment in JSON records * `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names * `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes * `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012) To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options. Also updated documentation to explain these options. Scala ![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png) Python ![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png) Author: Reynold Xin <rxin@databricks.com> Closes #9724 from rxin/SPARK-11745.
*	[SPARK-9928][SQL] Removal of LogicalLocalTable	gatorsmile	2015-11-15	1	-22/+0
\| \| \| \| \| \| \| \| \| \|	LogicalLocalTable in ExistingRDD.scala is replaced by localRelation in LocalRelation.scala? Do you know any reason why we still keep this class? Author: gatorsmile <gatorsmile@gmail.com> Closes #9717 from gatorsmile/LogicalLocalTable.
*	[SPARK-10181][SQL] Do kerberos login for credentials during hive client ↵	Yu Gao	2015-11-15	1	-1/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	initialization On driver process start up, UserGroupInformation.loginUserFromKeytab is called with the principal and keytab passed in, and therefore static var UserGroupInfomation,loginUser is set to that principal with kerberos credentials saved in its private credential set, and all threads within the driver process are supposed to see and use this login credentials to authenticate with Hive and Hadoop. However, because of IsolatedClientLoader, UserGroupInformation class is not shared for hive metastore clients, and instead it is loaded separately and of course not able to see the prepared kerberos login credentials in the main thread. The first proposed fix would cause other classloader conflict errors, and is not an appropriate solution. This new change does kerberos login during hive client initialization, which will make credentials ready for the particular hive client instance. yhuai Please take a look and let me know. If you are not the right person to talk to, could you point me to someone responsible for this? Author: Yu Gao <ygao@us.ibm.com> Author: gaoyu <gaoyu@gaoyu-macbookpro.roam.corp.google.com> Author: Yu Gao <crystalgaoyu@gmail.com> Closes #9272 from yolandagao/master.
*	[SPARK-11738] [SQL] Making ArrayType orderable	Yin Huai	2015-11-15	14	-94/+335
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11738 Author: Yin Huai <yhuai@databricks.com> Closes #9718 from yhuai/makingArrayOrderable.
*	[SPARK-11734][SQL] Rename TungstenProject -> Project, TungstenSort -> Sort	Reynold Xin	2015-11-15	15	-184/+148
\| \| \| \| \| \| \| \|	I didn't remove the old Sort operator, since we still use it in randomized tests. I moved it into test module and renamed it ReferenceSort. Author: Reynold Xin <rxin@databricks.com> Closes #9700 from rxin/SPARK-11734.
*	[SPARK-11736][SQL] Add monotonically_increasing_id to function registry.	Yin Huai	2015-11-14	2	-1/+6
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11736 Author: Yin Huai <yhuai@databricks.com> Closes #9703 from yhuai/MonotonicallyIncreasingID.
*	[SPARK-11694][SQL] Parquet logical types are not being tested properly	hyukjinkwon	2015-11-14	2	-9/+47
\| \| \| \| \| \| \| \| \|	All the physical types are properly tested at `ParquetIOSuite` but logical type mapping is not being tested. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #9660 from HyukjinKwon/SPARK-11694.
*	[SPARK-7970] Skip closure cleaning for SQL operations	nitin goyal	2015-11-13	8	-20/+20
\| \| \| \| \| \| \| \| \|	Also introduces new spark private API in RDD.scala with name 'mapPartitionsInternal' which doesn't closure cleans the RDD elements. Author: nitin goyal <nitin.goyal@guavus.com> Author: nitin.goyal <nitin.goyal@guavus.com> Closes #9253 from nitin2goyal/master.
*	[SPARK-11727][SQL] Split ExpressionEncoder into FlatEncoder and ProductEncoder	Wenchen Fan	2015-11-13	12	-289/+766
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	also add more tests for encoders, and fix bugs that I found: * when convert array to catalyst array, we can only skip element conversion for native types(e.g. int, long, boolean), not `AtomicType`(String is AtomicType but we need to convert it) * we should also handle scala `BigDecimal` when convert from catalyst `Decimal`. * complex map type should be supported other issues that still in investigation: * encode java `BigDecimal` and decode it back, seems we will loss precision info. * when encode case class that defined inside a object, `ClassNotFound` exception will be thrown. I'll remove unused code in a follow-up PR. Author: Wenchen Fan <wenchen@databricks.com> Closes #9693 from cloud-fan/split.
*	[SPARK-11654][SQL][FOLLOW-UP] fix some mistakes and clean up	Wenchen Fan	2015-11-13	7	-15/+17
\| \| \| \| \| \| \| \| \| \| \|	* rename `AppendColumn` to `AppendColumns` to be consistent with the physical plan name. * clean up stale comments. * always pass in resolved encoder to `TypedColumn.withInputType`(test added) * enable a mistakenly disabled java test. Author: Wenchen Fan <wenchen@databricks.com> Closes #9688 from cloud-fan/follow.
*	[SPARK-11678][SQL] Partition discovery should stop at the root path of the ↵	Yin Huai	2015-11-13	10	-51/+235
\| \| \| \| \| \| \| \| \| \| \| \|	table. https://issues.apache.org/jira/browse/SPARK-11678 The change of this PR is to pass root paths of table to the partition discovery logic. So, the process of partition discovery stops at those root paths instead of going all the way to the root path of the file system. Author: Yin Huai <yhuai@databricks.com> Closes #9651 from yhuai/SPARK-11678.
*	[SPARK-11654][SQL] add reduce to GroupedDataset	Michael Armbrust	2015-11-12	15	-197/+309
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a new method, `reduce`, to `GroupedDataset`, which allows similar operations to `reduceByKey` on a traditional `PairRDD`. ```scala val ds = Seq("abc", "xyz", "hello").toDS() ds.groupBy(_.length).reduce(_ + _).collect() // not actually commutative :P res0: Array(3 -> "abcxyz", 5 -> "hello") ``` While implementing this method and its test cases several more deficiencies were found in our encoder handling. Specifically, in order to support positional resolution, named resolution and tuple composition, it is important to keep the unresolved encoder around and to use it when constructing new `Datasets` with the same object type but different output attributes. We now divide the encoder lifecycle into three phases (that mirror the lifecycle of standard expressions) and have checks at various boundaries: - Unresoved Encoders: all users facing encoders (those constructed by implicits, static methods, or tuple composition) are unresolved, meaning they have only `UnresolvedAttributes` for named fields and `BoundReferences` for fields accessed by ordinal. - Resolved Encoders: internal to a `[Grouped]Dataset` the encoder is resolved, meaning all input has been resolved to a specific `AttributeReference`. Any encoders that are placed into a logical plan for use in object construction should be resolved. - BoundEncoder: Are constructed by physical plans, right before actual conversion from row -> object is performed. It is left to future work to add explicit checks for resolution and provide good error messages when it fails. We might also consider enforcing the above constraints in the type system (i.e. `fromRow` only exists on a `ResolvedEncoder`), but we should probably wait before spending too much time on this. Author: Michael Armbrust <michael@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #9673 from marmbrus/pr/9628.
*	[SPARK-11420] Updating Stddev support via Imperative Aggregate	JihongMa	2015-11-12	8	-112/+49
\| \| \| \| \| \| \| \|	switched stddev support from DeclarativeAggregate to ImperativeAggregate. Author: JihongMa <linlin200605@gmail.com> Closes #9380 from JihongMA/SPARK-11420.
*	[SPARK-10113][SQL] Explicit error message for unsigned Parquet logical types	hyukjinkwon	2015-11-12	2	-0/+31
\| \| \| \| \| \| \| \|	Parquet supports some unsigned datatypes. However, Since Spark does not support unsigned datatypes, it needs to emit an exception with a clear message rather then with the one saying illegal datatype. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9646 from HyukjinKwon/SPARK-10113.
*	[SPARK-11191][SQL] Looks up temporary function using execution Hive client	Cheng Lian	2015-11-12	3	-5/+56
\| \| \| \| \| \| \| \|	When looking up Hive temporary functions, we should always use the `SessionState` within the execution Hive client, since temporary functions are registered there. Author: Cheng Lian <lian@databricks.com> Closes #9664 from liancheng/spark-11191.fix-temp-function.
*	[SPARK-11673][SQL] Remove the normal Project physical operator (and keep ↵	Reynold Xin	2015-11-12	27	-287/+80
\| \| \| \| \| \| \| \| \| \|	TungstenProject) Also make full outer join being able to produce UnsafeRows. Author: Reynold Xin <rxin@databricks.com> Closes #9643 from rxin/SPARK-11673.