aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11561][SQL] Rename text data source's column name to value.Reynold Xin2015-11-062-5/+3
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9527 from rxin/SPARK-11561.
* [SPARK-11450] [SQL] Add Unsafe Row processing to ExpandHerman van Hovell2015-11-064-14/+73
| | | | | | | | This PR enables the Expand operator to process and produce Unsafe Rows. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9414 from hvanhovell/SPARK-11450.
* [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bitsImran Rashid2015-11-063-8/+10
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
* [SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of ↵Yin Huai2015-11-064-62/+167
| | | | | | | | | | | | post-shuffle partitions for aggregates and joins (follow-up) https://issues.apache.org/jira/browse/SPARK-9858 This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments. Author: Yin Huai <yhuai@databricks.com> Closes #9453 from yhuai/numReducer-followUp.
* [SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399Cheng Lian2015-11-063-46/+321
| | | | | | | | This PR adds test cases that test various column pruning and filter push-down cases. Author: Cheng Lian <lian@databricks.com> Closes #9468 from liancheng/spark-10978.follow-up.
* [SPARK-9162] [SQL] Implement code generation for ScalaUDFLiang-Chi Hsieh2015-11-062-2/+124
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9162 Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9270 from viirya/scalaudf-codegen.
* [SPARK-11453][SQL][FOLLOW-UP] remove DecimalLitWenchen Fan2015-11-063-29/+35
| | | | | | | | | | | A cleanup for https://github.com/apache/spark/pull/9085. The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them. Also added low level unit test at `SqlParserSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #9482 from cloud-fan/parser.
* [SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark ↵Reynold Xin2015-11-059-186/+314
| | | | | | | | various dialects as private. Author: Reynold Xin <rxin@databricks.com> Closes #9511 from rxin/SPARK-11541.
* [SPARK-11528] [SQL] Typed aggregations for DatasetsMichael Armbrust2015-11-054-3/+132
| | | | | | | | | | | | | | | This PR adds the ability to do typed SQL aggregations. We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds.groupBy(_._1).agg(sum("_2").as[Int]).collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9499 from marmbrus/dataset-agg.
* [SPARK-7542][SQL] Support off-heap index/sort bufferDavies Liu2015-11-051-1/+2
| | | | | | | | | | This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution. Closes #8068 Author: Davies Liu <davies@databricks.com> Closes #9477 from davies/unsafe_timsort.
* [SPARK-11540][SQL] API audit for QueryExecutionListener.Reynold Xin2015-11-052-59/+72
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9509 from rxin/SPARK-11540.
* Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs."Reynold Xin2015-11-055-215/+78
| | | | This reverts commit 9cf56c96b7d02a14175d40b336da14c2e1c88339.
* [SPARK-11537] [SQL] fix negative hours/minutes/secondsDavies Liu2015-11-052-8/+28
| | | | | | | | Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up). Author: Davies Liu <davies@databricks.com> Closes #9502 from davies/neg_hour.
* [SPARK-11536][SQL] Remove the internal implicit conversion from Expression ↵Reynold Xin2015-11-051-281/+299
| | | | | | | | to Column in functions.scala Author: Reynold Xin <rxin@databricks.com> Closes #9505 from rxin/SPARK-11536.
* [SPARK-10656][SQL] completely support special chars in DataFrameWenchen Fan2015-11-052-6/+16
| | | | | | | | | | | | the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close https://github.com/apache/spark/pull/8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars.
* [SPARK-11532][SQL] Remove implicit conversion from Expression to ColumnReynold Xin2015-11-051-52/+66
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9500 from rxin/SPARK-11532.
* [SPARK-10648] Oracle dialect to handle nonspecific numeric typesTravis Hegner2015-11-051-0/+25
| | | | | | | | | | This is the alternative/agreed upon solution to PR #8780. Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle. Author: Travis Hegner <thegner@trilliumit.com> Closes #9495 from travishegner/OracleDialect.
* [SPARK-11513][SQL] Remove implicit conversion from LogicalPlan to DataFrameReynold Xin2015-11-052-50/+78
| | | | | | | | This internal implicit conversion has been a source of confusion for a lot of new developers. Author: Reynold Xin <rxin@databricks.com> Closes #9479 from rxin/SPARK-11513.
* [SPARK-11474][SQL] change fetchSize to fetchsizeHuaxin Gao2015-11-051-1/+2
| | | | | | | | | | | | | | | | | | In DefaultDataSource.scala, it has override def createRelation( sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation The parameters is CaseInsensitiveMap. After this line parameters.foreach(kv => properties.setProperty(kv._1, kv._2)) properties is set to all lower case key/value pairs and fetchSize becomes fetchsize. However, in compute method in JDBCRDD, it has val fetchSize = properties.getProperty("fetchSize", "0").toInt so fetchSize value is always 0 and never gets set correctly. Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9473 from huaxingao/spark-11474.
* [MINOR][SQL] A minor log line fixCheng Lian2015-11-051-1/+2
| | | | | | | | `jars` in the log line is an array, so `$jars` doesn't print its content. Author: Cheng Lian <lian@databricks.com> Closes #9494 from liancheng/minor.log-fix.
* [SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items ↵Sean Owen2015-11-051-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | non-experimental if they've existed since 1.2.0 Remove `Experimental` annotations in core, streaming for items that existed in 1.2.0 or before. The changes are: * SparkContext * binary{Files,Records} : 1.2.0 * submitJob : 1.0.0 * JavaSparkContext * binary{Files,Records} : 1.2.0 * DoubleRDDFunctions, JavaDoubleRDD * {mean,sum}Approx : 1.0.0 * PairRDDFunctions, JavaPairRDD * sampleByKeyExact : 1.2.0 * countByKeyApprox : 1.0.0 * PairRDDFunctions * countApproxDistinctByKey : 1.1.0 * RDD * countApprox, countByValueApprox, countApproxDistinct : 1.0.0 * JavaRDDLike * countApprox : 1.0.0 * PythonHadoopUtil.Converter : 1.1.0 * PortableDataStream : 1.2.0 (related to binaryFiles) * BoundedDouble : 1.0.0 * PartialResult : 1.0.0 * StreamingContext, JavaStreamingContext * binaryRecordsStream : 1.2.0 * HiveContext * analyze : 1.2.0 Author: Sean Owen <sowen@cloudera.com> Closes #9396 from srowen/SPARK-11440.
* [SPARK-11425] [SPARK-11486] Improve hybrid aggregationDavies Liu2015-11-045-184/+95
| | | | | | | | After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them. Author: Davies Liu <davies@databricks.com> Closes #9383 from davies/fix_switch.
* [SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and ↵Zhenhua Wang2015-11-043-7/+12
| | | | | | | | | | | | | | | | | | misleading dialect conf at the start of spark-sql 1. def dialectClassName in HiveContext is unnecessary. In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new HiveQLDialect(this); else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls dialectClassName, which is overriden in HiveContext and still return super.dialectClassName. So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def dialectClassName in HiveContext. 2. When we start bin/spark-sql, the default context is HiveContext, and the corresponding dialect is hiveql. However, if we type "set spark.sql.dialect;", the result is "sql", which is inconsistent with the actual dialect and is misleading. For example, we can use sql like "create table" which is only allowed in hiveql, but this dialect conf shows it's "sql". Although this problem will not cause any execution error, it's misleading to spark sql users. Therefore I think we should fix it. In this pr, while procesing “set spark.sql.dialect” in SetCommand, I use "conf.dialect" instead of "getConf()" for the case of key == SQLConf.DIALECT.key, so that it will return the right dialect conf. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #9349 from wzhfy/dialect.
* [SPARK-11510][SQL] Remove SQL aggregation tests for higher order statisticsReynold Xin2015-11-043-147/+28
| | | | | | | | We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one. Author: Reynold Xin <rxin@databricks.com> Closes #9475 from rxin/SPARK-11510.
* [SPARK-11505][SQL] Break aggregate functions into multiple filesReynold Xin2015-11-0416-949/+1219
| | | | | | | | | | functions.scala was getting pretty long. I broke it into multiple files. I also added explicit data types for some public vals, and renamed aggregate function pretty names to lower case, which is more consistent with rest of the functions. Author: Reynold Xin <rxin@databricks.com> Closes #9471 from rxin/SPARK-11505.
* [SPARK-11504][SQL] API audit for distributeBy and localSortReynold Xin2015-11-043-83/+113
| | | | | | | | | 1. Renamed localSort -> sortWithinPartitions to avoid ambiguity in "local" 2. distributeBy -> repartition to match the existing repartition. Author: Reynold Xin <rxin@databricks.com> Closes #9470 from rxin/SPARK-11504.
* [SPARK-10304][SQL] Following up checking valid dir structure for partition ↵Liang-Chi Hsieh2015-11-042-1/+29
| | | | | | | | | | discovery This patch follows up #8840. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9459 from viirya/detect_invalid_part_dir_following.
* [SPARK-11490][SQL] variance should alias var_samp instead of var_pop.Reynold Xin2015-11-0411-114/+32
| | | | | | | | | | stddev is an alias for stddev_samp. variance should be consistent with stddev. Also took the chance to remove internal Stddev and Variance, and only kept StddevSamp/StddevPop and VarianceSamp/VariancePop. Author: Reynold Xin <rxin@databricks.com> Closes #9449 from rxin/SPARK-11490.
* [SPARK-11485][SQL] Make DataFrameHolder and DatasetHolder public.Reynold Xin2015-11-043-4/+18
| | | | | | | | These two classes should be public, since they are used in public code. Author: Reynold Xin <rxin@databricks.com> Closes #9445 from rxin/SPARK-11485.
* [SPARK-11455][SQL] fix case sensitivity of partition byWenchen Fan2015-11-034-11/+39
| | | | | | | | depend on `caseSensitive` to do column name equality check, instead of just `==` Author: Wenchen Fan <wenchen@databricks.com> Closes #9410 from cloud-fan/partition.
* [SPARK-11329] [SQL] Cleanup from spark-11329 fix.Nong2015-11-034-52/+55
| | | | | | Author: Nong <nong@cloudera.com> Closes #9442 from nongli/spark-11483.
* [SPARK-11489][SQL] Only include common first order statistics in GroupedDataReynold Xin2015-11-032-119/+28
| | | | | | | | | | | | | | | | | | We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions. That is to say, after this change, we won't support ```scala df.groupBy("key").kurtosis("colA", "colB") ``` However, we will still support ```scala df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB"))) ``` Author: Reynold Xin <rxin@databricks.com> Closes #9446 from rxin/SPARK-11489.
* [SPARK-11477] [SQL] support create Dataset from RDDWenchen Fan2015-11-043-0/+20
| | | | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9434 from cloud-fan/rdd2ds and squashes the following commits: 0892d72 [Wenchen Fan] support create Dataset from RDD
* [SPARK-11467][SQL] add Python API for stddev/varianceDavies Liu2015-11-031-67/+0
| | | | | | | | Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis Author: Davies Liu <davies@databricks.com> Closes #9424 from davies/py_var.
* [SPARK-10978][SQL] Allow data sources to eliminate filtersCheng Lian2015-11-036-68/+315
| | | | | | | | This PR adds a new method `unhandledFilters` to `BaseRelation`. Data sources which implement this method properly may avoid the overhead of defensive filtering done by Spark SQL. Author: Cheng Lian <lian@databricks.com> Closes #9399 from liancheng/spark-10978.unhandled-filters.
* [SPARK-10304] [SQL] Partition discovery should throw an exception if the dir ↵Liang-Chi Hsieh2015-11-032-13/+59
| | | | | | | | | | | | | | | | structure is invalid JIRA: https://issues.apache.org/jira/browse/SPARK-10304 This patch detects if the structure of partition directories is not valid. The test cases are from #8547. Thanks zhzhan. cc liancheng Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8840 from viirya/detect_invalid_part_dir.
* [SPARK-10533][SQL] handle scientific notation in sqlParserDaoyuan Wang2015-11-033-5/+32
| | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10533 val df = sqlContext.createDataFrame(Seq(("a",1.0),("b",2.0),("c",3.0))) df.filter("_2 < 2.0e1").show Scientific notation didn't work. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9085 from adrian-wang/scinotation.
* [SPARK-11404] [SQL] Support for groupBy using column expressionsMichael Armbrust2015-11-033-6/+106
| | | | | | | | | | | | | | | | | | | | | | | | This PR adds a new method `groupBy(cols: Column*)` to `Dataset` that allows users to group using column expressions instead of a lambda function. Since the return type of these expressions is not known at compile time, we just set the key type as a generic `Row`. If the user would like to work the key in a type-safe way, they can call `grouped.asKey[Type]`, which is also added in this PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() val grouped = ds.groupBy($"_1").asKey[String] val agged = grouped.mapGroups { case (g, iter) => Iterator((g, iter.map(_._2).sum)) } agged.collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9359 from marmbrus/columnGroupBy and squashes the following commits: bbcb03b [Michael Armbrust] Update DatasetSuite.scala 8fd2908 [Michael Armbrust] Update DatasetSuite.scala 0b0e2f8 [Michael Armbrust] [SPARK-11404] [SQL] Support for groupBy using column expressions
* [SPARK-11436] [SQL] rebind right encoder when join 2 datasetsWenchen Fan2015-11-032-1/+11
| | | | | | | | | | When we join 2 datasets, we will combine 2 encoders into a tupled one, and use it as the encoder for the jioned dataset. Assume both of the 2 encoders are flat, their `constructExpression`s both reference to the first element of input row. However, when we combine 2 encoders, the schema of input row changed, now the right encoder should reference to second element of input row. So we should rebind right encoder to let it know the new schema of input row before combine it. Author: Wenchen Fan <wenchen@databricks.com> Closes #9391 from cloud-fan/join and squashes the following commits: 846d3ab [Wenchen Fan] rebind right encoder when join 2 datasets
* [SPARK-10429] [SQL] make mutableProjection atomicDavies Liu2015-11-033-98/+97
| | | | | | | | | | | | | | | | | | | Right now, SQL's mutable projection updates every value of the mutable project after it evaluates the corresponding expression. This makes the behavior of MutableProjection confusing and complicate the implementation of common aggregate functions like stddev because developers need to be aware that when evaluating {{i+1}}th expression of a mutable projection, {{i}}th slot of the mutable row has already been updated. This PR make the MutableProjection atomic, by generating all the results of expressions first, then copy them into mutableRow. Had run a mircro-benchmark, there is no notable performance difference between using class members and local variables. cc yhuai Author: Davies Liu <davies@databricks.com> Closes #9422 from davies/atomic_mutable and squashes the following commits: bbc1758 [Davies Liu] support wide table 8a0ae14 [Davies Liu] fix bug bec07da [Davies Liu] refactor 2891628 [Davies Liu] make mutableProjection atomic
* [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add an ExchangeCoordinator to ↵Yin Huai2015-11-039-44/+1115
| | | | | | | | | | | | estimate the number of post-shuffle partitions for aggregates and joins https://issues.apache.org/jira/browse/SPARK-9858 https://issues.apache.org/jira/browse/SPARK-9859 https://issues.apache.org/jira/browse/SPARK-9861 Author: Yin Huai <yhuai@databricks.com> Closes #9276 from yhuai/numReducer.
* [SPARK-9034][SQL] Reflect field names defined in GenericUDTFnavis.ryu2015-11-0216-17/+34
| | | | | | | | | | Hive GenericUDTF#initialize() defines field names in a returned schema though, the current HiveGenericUDTF drops these names. We might need to reflect these in a logical plan tree. Author: navis.ryu <navis@apache.org> Closes #8456 from navis/SPARK-9034.
* [SPARK-11469][SQL] Allow users to define nondeterministic udfs.Yin Huai2015-11-025-78/+215
| | | | | | | | This is the first task (https://issues.apache.org/jira/browse/SPARK-11469) of https://issues.apache.org/jira/browse/SPARK-11438 Author: Yin Huai <yhuai@databricks.com> Closes #9393 from yhuai/udfNondeterministic.
* [SPARK-11329][SQL] Support star expansion for structs.Nong Li2015-11-026-38/+230
| | | | | | | | | | | | | | | 1. Supporting expanding structs in Projections. i.e. "SELECT s.*" where s is a struct type. This is fixed by allowing the expand function to handle structs in addition to tables. 2. Supporting expanding * inside aggregate functions of structs. "SELECT max(struct(col1, structCol.*))" This requires recursively expanding the expressions. In this case, it it the aggregate expression "max(...)" and we need to recursively expand its children inputs. Author: Nong Li <nongli@gmail.com> Closes #9343 from nongli/spark-11329.
* [SPARK-5354][SQL] Cached tables should preserve partitioning and ord…Nong Li2015-11-023-9/+97
| | | | | | | | | | | …ering. For cached tables, we can just maintain the partitioning and ordering from the source relation. Author: Nong Li <nongli@gmail.com> Closes #9404 from nongli/spark-5354.
* [SPARK-11371] Make "mean" an alias for "avg" operatortedyu2015-11-022-0/+10
| | | | | | | | | | | From Reynold in the thread 'Exception when using some aggregate operators' (http://search-hadoop.com/m/q3RTt0xFr22nXB4/): I don't think these are bugs. The SQL standard for average is "avg", not "mean". Similarly, a distinct count is supposed to be written as "count(distinct col)", not "countDistinct(col)". We can, however, make "mean" an alias for "avg" to improve compatibility between DataFrame and SQL. Author: tedyu <yuzhihong@gmail.com> Closes #9332 from ted-yu/master.
* [SPARK-11311][SQL] spark cannot describe temporary functionsDaoyuan Wang2015-11-022-1/+15
| | | | | | | | When describe temporary function, spark would return 'Unable to find function', this is not right. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9277 from adrian-wang/functionreg.
* [SPARK-10786][SQL] Take the whole statement to generate the CommandProcessorhuangzhaowei2015-11-021-1/+1
| | | | | | | | | | | | | | | | | | | In the now implementation of `SparkSQLCLIDriver.scala`: `val proc: CommandProcessor = CommandProcessorFactory.get(Array(tokens(0)), hconf)` `CommandProcessorFactory` only take the first token of the statement, and this will be hard to diff the statement `delete jar xxx` and `delete from xxx`. So maybe it's better to take the whole statement into the `CommandProcessorFactory`. And in [HiveCommand](https://github.com/SaintBacchus/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/processors/HiveCommand.java#L76), it already special handing these two statement. ```java if(command.length > 1 && "from".equalsIgnoreCase(command[1])) { //special handling for SQL "delete from <table> where..." return null; } ``` Author: huangzhaowei <carlmartinmax@gmail.com> Closes #8895 from SaintBacchus/SPARK-10786.
* [SPARK-9298][SQL] Add pearson correlation aggregation functionLiang-Chi Hsieh2015-11-017-2/+311
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9298 This patch adds pearson correlation aggregation function based on `AggregateExpression2`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8587 from viirya/corr_aggregation.
* [SPARK-11410][SQL] Add APIs to provide functionality similar to Hive's ↵Nong Li2015-11-014-19/+186
| | | | | | | | | | | | | | | | | | | DISTRIBUTE BY and SORT BY. DISTRIBUTE BY allows the user to hash partition the data by specified exprs. It also allows for optioning sorting within each resulting partition. There is no required relationship between the exprs for partitioning and sorting (i.e. one does not need to be a prefix of the other). This patch adds to APIs to DataFrames which can be used together to provide this functionality: 1. distributeBy() which partitions the data frame into a specified number of partitions using the partitioning exprs. 2. localSort() which sorts each partition using the provided sorting exprs. To get the DISTRIBUTE BY functionality, the user simply does: df.distributeBy(...).localSort(...) Author: Nong Li <nongli@gmail.com> Closes #9364 from nongli/spark-11410.