aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11616][SQL] Improve toString for DatasetMichael Armbrust2015-11-104-13/+47
| | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #9586 from marmbrus/dataset-toString.
* [SPARK-10371][SQL] Implement subexpr elimination for UnsafeProjectionsNong Li2015-11-1011-16/+523
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the building blocks for codegening subexpr elimination and implements it end to end for UnsafeProjection. The building blocks can be used to do the same thing for other operators. It introduces some utilities to compute common sub expressions. Expressions can be added to this data structure. The expr and its children will be recursively matched against existing expressions (ones previously added) and grouped into common groups. This is built using the existing `semanticEquals`. It does not understand things like commutative or associative expressions. This can be done as future work. After building this data structure, the codegen process takes advantage of it by: 1. Generating a helper function in the generated class that computes the common subexpression. This is done for all common subexpressions that have at least two occurrences and the expression tree is sufficiently complex. 2. When generating the apply() function, if the helper function exists, call that instead of regenerating the expression tree. Repeated calls to the helper function shortcircuit the evaluation logic. Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9480 from nongli/spark-10371.
* [SPARK-11590][SQL] use native json_tuple in lateral viewWenchen Fan2015-11-108-40/+104
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9562 from cloud-fan/json-tuple.
* [SPARK-11578][SQL][FOLLOW-UP] complete the user facing api for typed aggregationWenchen Fan2015-11-104-14/+99
| | | | | | | | | | | | | | Currently the user facing api for typed aggregation has some limitations: * the customized typed aggregation must be the first of aggregation list * the customized typed aggregation can only use long as buffer type * the customized typed aggregation can only use flat type as result type This PR tries to remove these limitations. Author: Wenchen Fan <wenchen@databricks.com> Closes #9599 from cloud-fan/agg.
* [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to ↵Yin Huai2015-11-1060-2256/+739
| | | | | | | | | | | | | | | | | | | evaluate AggregateExpression1s https://issues.apache.org/jira/browse/SPARK-9830 This PR contains the following main changes. * Removing `AggregateExpression1`. * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`. * Removing planner rule used to plan `Aggregate`. * Linking `MultipleDistinctRewriter` to analyzer. * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`. * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`. * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved). Author: Yin Huai <yhuai@databricks.com> Closes #9556 from yhuai/removeAgg1.
* [SPARK-11598] [SQL] enable tests for ShuffledHashOuterJoinDavies Liu2015-11-091-204/+231
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #9573 from davies/join_condition.
* [SPARK-11599] [SQL] fix NPE when resolve Hive UDF in SQLParserDavies Liu2015-11-092-9/+34
| | | | | | | | | | The DataFrame APIs that takes a SQL expression always use SQLParser, then the HiveFunctionRegistry will called outside of Hive state, cause NPE if there is not a active Session State for current thread (in PySpark). cc rxin yhuai Author: Davies Liu <davies@databricks.com> Closes #9576 from davies/hive_udf.
* [SPARK-11564][SQL] Fix documentation for DataFrame.take/collectReynold Xin2015-11-091-4/+4
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9557 from rxin/SPARK-11564-1.
* [SPARK-11578][SQL] User API for Typed AggregationMichael Armbrust2015-11-099-42/+360
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds a new interface for user-defined aggregations, that can be used in `DataFrame` and `Dataset` operations to take all of the elements of a group and reduce them to a single value. For example, the following aggregator extracts an `int` from a specific class and adds them up: ```scala case class Data(i: Int) val customSummer = new Aggregator[Data, Int, Int] { def prepare(d: Data) = d.i def reduce(l: Int, r: Int) = l + r def present(r: Int) = r }.toColumn() val ds: Dataset[Data] = ... val aggregated = ds.select(customSummer) ``` By using helper functions, users can make a generic `Aggregator` that works on any input type: ```scala /** An `Aggregator` that adds up any numeric type returned by the given function. */ class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable { val numeric = implicitly[Numeric[N]] override def zero: N = numeric.zero override def reduce(b: N, a: I): N = numeric.plus(b, f(a)) override def present(reduction: N): N = reduction } def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new SumOf(f).toColumn ``` These aggregators can then be used alongside other built-in SQL aggregations. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds .groupBy(_._1) .agg( sum(_._2), // The aggregator defined above. expr("sum(_2)").as[Int], // A built-in dynatically typed aggregation. count("*")) // A built-in statically typed aggregation. .collect() res0: ("a", 30, 30, 2L), ("b", 3, 3, 2L), ("c", 1, 1, 1L) ``` The current implementation focuses on integrating this into the typed API, but currently only supports running aggregations that return a single long value as explained in `TypedAggregateExpression`. This will be improved in a followup PR. Author: Michael Armbrust <michael@databricks.com> Closes #9555 from marmbrus/dataset-useragg.
* [SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters codehyukjinkwon2015-11-091-4/+4
| | | | | | | | | | | | | | | | | | | Actually this was resolved by https://github.com/apache/spark/pull/8275. But I found the JIRA issue for this is not marked as resolved since the PR above was made for another issue but the PR above resolved both. I commented that this is resolved by the PR above; however, I opened this PR as I would like to just add a little bit of corrections. In the previous PR, I refactored the test by not reducing just collecting filters; however, this would not test properly `And` filter (which is not given to the tests). I unintentionally changed this from the original way (before being refactored). In this PR, I just followed the original way to collect filters by reducing. I would like to close this if this PR is inappropriate and somebody would like this deal with it in the separate PR related with this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9554 from HyukjinKwon/SPARK-9557.
* [SPARK-11564][SQL][FOLLOW-UP] improve java api for GroupedDatasetWenchen Fan2015-11-094-23/+31
| | | | | | | | created `MapGroupFunction`, `FlatMapGroupFunction`, `CoGroupFunction` Author: Wenchen Fan <wenchen@databricks.com> Closes #9564 from cloud-fan/map.
* [SPARK-11595] [SQL] Fixes ADD JAR when the input path contains URL schemeCheng Lian2015-11-094-11/+18
| | | | | | Author: Cheng Lian <lian@databricks.com> Closes #9569 from liancheng/spark-11595.fix-add-jar.
* [SPARK-9301][SQL] Add collect_set and collect_list aggregate functionsNick Buroojy2015-11-092-2/+33
| | | | | | | | | | | | | | | | | | | | | For now they are thin wrappers around the corresponding Hive UDAFs. One limitation with these in Hive 0.13.0 is they only support aggregating primitive types. I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns. Do we also want to add these to `functions.py`? This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089 marmbrus rxin Author: Nick Buroojy <nick.buroojy@civitaslearning.com> Closes #9526 from nburoojy/nick/udaf-alias. (cherry picked from commit a6ee4f989d020420dd08b97abb24802200ff23b2) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-11453][SQL] append data to partitioned table will messes up the resultWenchen Fan2015-11-083-4/+53
| | | | | | | | | | | | | The reason is that: 1. For partitioned hive table, we will move the partitioned columns after data columns. (e.g. `<a: Int, b: Int>` partition by `a` will become `<b: Int, a: Int>`) 2. When append data to table, we use position to figure out how to match input columns to table's columns. So when we append data to partitioned table, we will match wrong columns between input and table. A solution is reordering the input columns before match by position, like what we did for [`InsertIntoHadoopFsRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L101-L105) Author: Wenchen Fan <wenchen@databricks.com> Closes #9408 from cloud-fan/append.
* [SPARK-11564][SQL] Dataset Java API auditReynold Xin2015-11-087-96/+147
| | | | | | | | | | | | | | A few changes: 1. Removed fold, since it can be confusing for distributed collections. 2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction) 3. Added more documentation and test cases. The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function. Author: Reynold Xin <rxin@databricks.com> Closes #9531 from rxin/SPARK-11564.
* [SPARK-11554][SQL] add map/flatMap to GroupedDatasetWenchen Fan2015-11-086-37/+70
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9521 from cloud-fan/map.
* [SPARK-11451][SQL] Support single distinct count on multiple columns.Herman van Hovell2015-11-086-26/+127
| | | | | | | | | | This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9409 from hvanhovell/SPARK-11451.
* [SPARK-11362] [SQL] Use Spark BitSet in BroadcastNestedLoopJoinLiang-Chi Hsieh2015-11-071-10/+8
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11362 We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We should use Spark's BitSet. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9316 from viirya/use-spark-bitset.
* [SPARK-9241][SQL] Supporting multiple DISTINCT columns - follow-upHerman van Hovell2015-11-072-23/+108
| | | | | | | | | | This PR is a follow up for PR https://github.com/apache/spark/pull/9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9541 from hvanhovell/SPARK-9241-followup.
* [SPARK-11546] Thrift server makes too many logs about result schemanavis.ryu2015-11-061-11/+13
| | | | | | | | SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file. Author: navis.ryu <navis@apache.org> Closes #9514 from navis/SPARK-11546.
* [SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting RuleHerman van Hovell2015-11-066-44/+238
| | | | | | | | | | | | | | | | The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](https://github.com/apache/spark/pull/9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9406 from hvanhovell/SPARK-9241-rewriter.
* [SPARK-11269][SQL] Java API support & test cases for DatasetWenchen Fan2015-11-068-12/+644
| | | | | | | | | This simply brings https://github.com/apache/spark/pull/9358 up-to-date. Author: Wenchen Fan <wenchen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #9528 from rxin/dataset-java.
* [SPARK-11561][SQL] Rename text data source's column name to value.Reynold Xin2015-11-062-5/+3
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9527 from rxin/SPARK-11561.
* [SPARK-11450] [SQL] Add Unsafe Row processing to ExpandHerman van Hovell2015-11-064-14/+73
| | | | | | | | This PR enables the Expand operator to process and produce Unsafe Rows. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9414 from hvanhovell/SPARK-11450.
* [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bitsImran Rashid2015-11-063-8/+10
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
* [SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of ↵Yin Huai2015-11-064-62/+167
| | | | | | | | | | | | post-shuffle partitions for aggregates and joins (follow-up) https://issues.apache.org/jira/browse/SPARK-9858 This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments. Author: Yin Huai <yhuai@databricks.com> Closes #9453 from yhuai/numReducer-followUp.
* [SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399Cheng Lian2015-11-063-46/+321
| | | | | | | | This PR adds test cases that test various column pruning and filter push-down cases. Author: Cheng Lian <lian@databricks.com> Closes #9468 from liancheng/spark-10978.follow-up.
* [SPARK-9162] [SQL] Implement code generation for ScalaUDFLiang-Chi Hsieh2015-11-062-2/+124
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9162 Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9270 from viirya/scalaudf-codegen.
* [SPARK-11453][SQL][FOLLOW-UP] remove DecimalLitWenchen Fan2015-11-063-29/+35
| | | | | | | | | | | A cleanup for https://github.com/apache/spark/pull/9085. The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them. Also added low level unit test at `SqlParserSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #9482 from cloud-fan/parser.
* [SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark ↵Reynold Xin2015-11-059-186/+314
| | | | | | | | various dialects as private. Author: Reynold Xin <rxin@databricks.com> Closes #9511 from rxin/SPARK-11541.
* [SPARK-11528] [SQL] Typed aggregations for DatasetsMichael Armbrust2015-11-054-3/+132
| | | | | | | | | | | | | | | This PR adds the ability to do typed SQL aggregations. We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds.groupBy(_._1).agg(sum("_2").as[Int]).collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9499 from marmbrus/dataset-agg.
* [SPARK-7542][SQL] Support off-heap index/sort bufferDavies Liu2015-11-051-1/+2
| | | | | | | | | | This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution. Closes #8068 Author: Davies Liu <davies@databricks.com> Closes #9477 from davies/unsafe_timsort.
* [SPARK-11540][SQL] API audit for QueryExecutionListener.Reynold Xin2015-11-052-59/+72
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9509 from rxin/SPARK-11540.
* Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs."Reynold Xin2015-11-055-215/+78
| | | | This reverts commit 9cf56c96b7d02a14175d40b336da14c2e1c88339.
* [SPARK-11537] [SQL] fix negative hours/minutes/secondsDavies Liu2015-11-052-8/+28
| | | | | | | | Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up). Author: Davies Liu <davies@databricks.com> Closes #9502 from davies/neg_hour.
* [SPARK-11536][SQL] Remove the internal implicit conversion from Expression ↵Reynold Xin2015-11-051-281/+299
| | | | | | | | to Column in functions.scala Author: Reynold Xin <rxin@databricks.com> Closes #9505 from rxin/SPARK-11536.
* [SPARK-10656][SQL] completely support special chars in DataFrameWenchen Fan2015-11-052-6/+16
| | | | | | | | | | | | the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close https://github.com/apache/spark/pull/8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars.
* [SPARK-11532][SQL] Remove implicit conversion from Expression to ColumnReynold Xin2015-11-051-52/+66
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9500 from rxin/SPARK-11532.
* [SPARK-10648] Oracle dialect to handle nonspecific numeric typesTravis Hegner2015-11-051-0/+25
| | | | | | | | | | This is the alternative/agreed upon solution to PR #8780. Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle. Author: Travis Hegner <thegner@trilliumit.com> Closes #9495 from travishegner/OracleDialect.
* [SPARK-11513][SQL] Remove implicit conversion from LogicalPlan to DataFrameReynold Xin2015-11-052-50/+78
| | | | | | | | This internal implicit conversion has been a source of confusion for a lot of new developers. Author: Reynold Xin <rxin@databricks.com> Closes #9479 from rxin/SPARK-11513.
* [SPARK-11474][SQL] change fetchSize to fetchsizeHuaxin Gao2015-11-051-1/+2
| | | | | | | | | | | | | | | | | | In DefaultDataSource.scala, it has override def createRelation( sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation The parameters is CaseInsensitiveMap. After this line parameters.foreach(kv => properties.setProperty(kv._1, kv._2)) properties is set to all lower case key/value pairs and fetchSize becomes fetchsize. However, in compute method in JDBCRDD, it has val fetchSize = properties.getProperty("fetchSize", "0").toInt so fetchSize value is always 0 and never gets set correctly. Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9473 from huaxingao/spark-11474.
* [MINOR][SQL] A minor log line fixCheng Lian2015-11-051-1/+2
| | | | | | | | `jars` in the log line is an array, so `$jars` doesn't print its content. Author: Cheng Lian <lian@databricks.com> Closes #9494 from liancheng/minor.log-fix.
* [SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items ↵Sean Owen2015-11-051-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | non-experimental if they've existed since 1.2.0 Remove `Experimental` annotations in core, streaming for items that existed in 1.2.0 or before. The changes are: * SparkContext * binary{Files,Records} : 1.2.0 * submitJob : 1.0.0 * JavaSparkContext * binary{Files,Records} : 1.2.0 * DoubleRDDFunctions, JavaDoubleRDD * {mean,sum}Approx : 1.0.0 * PairRDDFunctions, JavaPairRDD * sampleByKeyExact : 1.2.0 * countByKeyApprox : 1.0.0 * PairRDDFunctions * countApproxDistinctByKey : 1.1.0 * RDD * countApprox, countByValueApprox, countApproxDistinct : 1.0.0 * JavaRDDLike * countApprox : 1.0.0 * PythonHadoopUtil.Converter : 1.1.0 * PortableDataStream : 1.2.0 (related to binaryFiles) * BoundedDouble : 1.0.0 * PartialResult : 1.0.0 * StreamingContext, JavaStreamingContext * binaryRecordsStream : 1.2.0 * HiveContext * analyze : 1.2.0 Author: Sean Owen <sowen@cloudera.com> Closes #9396 from srowen/SPARK-11440.
* [SPARK-11425] [SPARK-11486] Improve hybrid aggregationDavies Liu2015-11-045-184/+95
| | | | | | | | After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them. Author: Davies Liu <davies@databricks.com> Closes #9383 from davies/fix_switch.
* [SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and ↵Zhenhua Wang2015-11-043-7/+12
| | | | | | | | | | | | | | | | | | misleading dialect conf at the start of spark-sql 1. def dialectClassName in HiveContext is unnecessary. In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new HiveQLDialect(this); else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls dialectClassName, which is overriden in HiveContext and still return super.dialectClassName. So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def dialectClassName in HiveContext. 2. When we start bin/spark-sql, the default context is HiveContext, and the corresponding dialect is hiveql. However, if we type "set spark.sql.dialect;", the result is "sql", which is inconsistent with the actual dialect and is misleading. For example, we can use sql like "create table" which is only allowed in hiveql, but this dialect conf shows it's "sql". Although this problem will not cause any execution error, it's misleading to spark sql users. Therefore I think we should fix it. In this pr, while procesing “set spark.sql.dialect” in SetCommand, I use "conf.dialect" instead of "getConf()" for the case of key == SQLConf.DIALECT.key, so that it will return the right dialect conf. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #9349 from wzhfy/dialect.
* [SPARK-11510][SQL] Remove SQL aggregation tests for higher order statisticsReynold Xin2015-11-043-147/+28
| | | | | | | | We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one. Author: Reynold Xin <rxin@databricks.com> Closes #9475 from rxin/SPARK-11510.
* [SPARK-11505][SQL] Break aggregate functions into multiple filesReynold Xin2015-11-0416-949/+1219
| | | | | | | | | | functions.scala was getting pretty long. I broke it into multiple files. I also added explicit data types for some public vals, and renamed aggregate function pretty names to lower case, which is more consistent with rest of the functions. Author: Reynold Xin <rxin@databricks.com> Closes #9471 from rxin/SPARK-11505.
* [SPARK-11504][SQL] API audit for distributeBy and localSortReynold Xin2015-11-043-83/+113
| | | | | | | | | 1. Renamed localSort -> sortWithinPartitions to avoid ambiguity in "local" 2. distributeBy -> repartition to match the existing repartition. Author: Reynold Xin <rxin@databricks.com> Closes #9470 from rxin/SPARK-11504.
* [SPARK-10304][SQL] Following up checking valid dir structure for partition ↵Liang-Chi Hsieh2015-11-042-1/+29
| | | | | | | | | | discovery This patch follows up #8840. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9459 from viirya/detect_invalid_part_dir_following.
* [SPARK-11490][SQL] variance should alias var_samp instead of var_pop.Reynold Xin2015-11-0411-114/+32
| | | | | | | | | | stddev is an alias for stddev_samp. variance should be consistent with stddev. Also took the chance to remove internal Stddev and Variance, and only kept StddevSamp/StddevPop and VarianceSamp/VariancePop. Author: Reynold Xin <rxin@databricks.com> Closes #9449 from rxin/SPARK-11490.