aboutsummaryrefslogtreecommitdiff
path: root/sql/catalyst
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-11352][SQL] Escape */ in the generated comments.Yin Huai2015-12-013-3/+18
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai <yhuai@databricks.com> Closes #10072 from yhuai/SPARK-11352.
* [SPARK-11954][SQL] Encoder for JavaBeansWenchen Fan2015-12-018-16/+438
| | | | | | | | | | | create java version of `constructorFor` and `extractorFor` in `JavaTypeInference` Author: Wenchen Fan <wenchen@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #9937 from cloud-fan/pojo.
* [SPARK-11856][SQL] add type cast if the real type is different but ↵Wenchen Fan2015-12-018-22/+320
| | | | | | | | | | | compatible with encoder schema When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema. For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string. Author: Wenchen Fan <wenchen@databricks.com> Closes #9840 from cloud-fan/err-msg.
* [SPARK-11949][SQL] Set field nullable property for GroupingSets to get ↵Liang-Chi Hsieh2015-12-011-2/+8
| | | | | | | | | | | | correct results for null values JIRA: https://issues.apache.org/jira/browse/SPARK-11949 The result of cube plan uses incorrect schema. The schema of cube result should set nullable property to true because the grouping expressions will have null values. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10038 from viirya/fix-cube.
* [SPARK-12018][SQL] Refactor common subexpression elimination codeLiang-Chi Hsieh2015-11-303-34/+14
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12018 The code of common subexpression elimination can be factored and simplified. Some unnecessary variables can be removed. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10009 from viirya/refactor-subexpr-eliminate.
* [SPARK-12024][SQL] More efficient multi-column counting.Herman van Hovell2015-11-294-64/+12
| | | | | | | | | | | | In https://github.com/apache/spark/pull/9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null. This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10015 from hvanhovell/SPARK-12024.
* [SPARK-12028] [SQL] get_json_object returns an incorrect result when the ↵gatorsmile2015-11-271-2/+5
| | | | | | | | | | | | | | | | | | | | | | | value is null literals When calling `get_json_object` for the following two cases, both results are `"null"`: ```scala val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil val df: DataFrame = tuple.toDF("key", "jstring") val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` ```scala val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil val df2: DataFrame = tuple2.toDF("key", "jstring") val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` Fixed the problem and also added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #10018 from gatorsmile/get_json_object.
* [SPARK-11973][SQL] Improve optimizer code readability.Reynold Xin2015-11-262-26/+26
| | | | | | | | | | This is a followup for https://github.com/apache/spark/pull/9959. I added more documentation and rewrote some monadic code into simpler ifs. Author: Reynold Xin <rxin@databricks.com> Closes #9995 from rxin/SPARK-11973.
* [SPARK-11863][SQL] Unable to resolve order by if it contains mixture of ↵Dilip Biswal2015-11-262-3/+28
| | | | | | | | | | | | | | | | aliases and real columns this is based on https://github.com/apache/spark/pull/9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <dbiswal@us.ibm.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #9961 from cloud-fan/sort.
* [SPARK-12005][SQL] Work around VerifyError in HyperLogLogPlusPlus.Marcelo Vanzin2015-11-261-5/+8
| | | | | | | | Just move the code around a bit; that seems to make the JVM happy. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9985 from vanzin/SPARK-12005.
* [SPARK-11973] [SQL] push filter through aggregation with alias and literalsDavies Liu2015-11-263-11/+79
| | | | | | | | | | | | Currently, filter can't be pushed through aggregation with alias or literals, this patch fix that. After this patch, the time of TPC-DS query 4 go down to 13 seconds from 141 seconds (10x improvements). cc nongli yhuai Author: Davies Liu <davies@databricks.com> Closes #9959 from davies/push_filter2.
* [SPARK-12003] [SQL] remove the prefix for name after expanded starDavies Liu2015-11-251-1/+1
| | | | | | | | Right now, the expended start will include the name of expression as prefix for column, that's not better than without expending, we should not have the prefix. Author: Davies Liu <davies@databricks.com> Closes #9984 from davies/expand_star.
* [SPARK-11983][SQL] remove all unused codegen fallback traitDaoyuan Wang2015-11-253-6/+4
| | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #9966 from adrian-wang/removeFallback.
* [SPARK-11946][SQL] Audit pivot API for 1.6.Reynold Xin2015-11-241-0/+1
| | | | | | | | | | | | | | | | | | | | Currently pivot's signature looks like ```scala scala.annotation.varargs def pivot(pivotColumn: Column, values: Column*): GroupedData scala.annotation.varargs def pivot(pivotColumn: String, values: Any*): GroupedData ``` I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List. I also made similar changes for Python. Author: Reynold Xin <rxin@databricks.com> Closes #9929 from rxin/SPARK-11946.
* [SPARK-11926][SQL] unify GetStructField and GetInternalRowFieldWenchen Fan2015-11-249-42/+21
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9909 from cloud-fan/get-struct.
* [SPARK-11942][SQL] fix encoder life cycle for CoGroupWenchen Fan2015-11-241-12/+15
| | | | | | | | we should pass in resolved encodera to logical `CoGroup` and bind them in physical `CoGroup` Author: Wenchen Fan <wenchen@databricks.com> Closes #9928 from cloud-fan/cogroup.
* [SPARK-10707][SQL] Fix nullability computation in union outputMikhail Bautin2015-11-231-3/+8
| | | | | | Author: Mikhail Bautin <mbautin@gmail.com> Closes #9308 from mbautin/SPARK-10707.
* [SPARK-11921][SQL] fix `nullable` of encoder schemaWenchen Fan2015-11-232-3/+50
| | | | | | Author: Wenchen Fan <wenchen@databricks.com> Closes #9906 from cloud-fan/nullable.
* [SPARK-11894][SQL] fix isNull for GetInternalRowFieldWenchen Fan2015-11-231-14/+9
| | | | | | | | | | We should use `InternalRow.isNullAt` to check if the field is null before calling `InternalRow.getXXX` Thanks gatorsmile who discovered this bug. Author: Wenchen Fan <wenchen@databricks.com> Closes #9904 from cloud-fan/null.
* [SPARK-11628][SQL] support column datatype of char(x) to recognize HiveCharXiu Guo2015-11-232-3/+11
| | | | | | | | | Can someone review my code to make sure I'm not missing anything? Thanks! Author: Xiu Guo <xguo27@gmail.com> Author: Xiu Guo <guoxi@us.ibm.com> Closes #9612 from xguo27/SPARK-11628.
* [SPARK-11908][SQL] Add NullType support to RowEncoderLiang-Chi Hsieh2015-11-223-2/+9
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11908 We should add NullType support to RowEncoder. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9891 from viirya/rowencoder-nulltype.
* [SPARK-11899][SQL] API audit for GroupedDataset.Reynold Xin2015-11-212-2/+5
| | | | | | | | | | | | 1. Renamed map to mapGroup, flatMap to flatMapGroup. 2. Renamed asKey -> keyAs. 3. Added more documentation. 4. Changed type parameter T to V on GroupedDataset. 5. Added since versions for all functions. Author: Reynold Xin <rxin@databricks.com> Closes #9880 from rxin/SPARK-11899.
* [SPARK-11900][SQL] Add since version for all encodersReynold Xin2015-11-211-0/+63
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #9881 from rxin/SPARK-11900.
* [SPARK-11819][SQL][FOLLOW-UP] fix scala 2.11 buildWenchen Fan2015-11-201-2/+2
| | | | | | | | seems scala 2.11 doesn't support: define private methods in `trait xxx` and use it in `object xxx extend xxx`. Author: Wenchen Fan <wenchen@databricks.com> Closes #9879 from cloud-fan/follow.
* [SPARK-11890][SQL] Fix compilation for Scala 2.11Michael Armbrust2015-11-201-2/+2
| | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #9871 from marmbrus/scala211-break.
* [SPARK-11787][SPARK-11883][SQL][FOLLOW-UP] Cleanup for this patch.Nong Li2015-11-202-16/+49
| | | | | | | | | | This mainly moves SqlNewHadoopRDD to the sql package. There is some state that is shared between core and I've left that in core. This allows some other associated minor cleanup. Author: Nong Li <nong@databricks.com> Closes #9845 from nongli/spark-11787.
* [SPARK-11636][SQL] Support classes defined in the REPL with EncodersMichael Armbrust2015-11-201-2/+2
| | | | | | | | | #theScaryParts (i.e. changes to the repl, executor classloaders and codegen)... Author: Michael Armbrust <michael@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #9825 from marmbrus/dataset-replClasses2.
* [SPARK-11724][SQL] Change casting between int and timestamp to consistently ↵Nong Li2015-11-202-10/+12
| | | | | | | | | | | | treat int in seconds. Hive has since changed this behavior as well. https://issues.apache.org/jira/browse/HIVE-3454 Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Author: Yin Huai <yhuai@databricks.com> Closes #9685 from nongli/spark-11724.
* [SPARK-11819][SQL] nice error message for missing encoderWenchen Fan2015-11-202-23/+129
| | | | | | | | | | | | | | | | before this PR, when users try to get an encoder for an un-supported class, they will only get a very simple error message like `Encoder for type xxx is not supported`. After this PR, the error message become more friendly, for example: ``` No Encoder found for abc.xyz.NonEncodable - array element class: "abc.xyz.NonEncodable" - field (class: "scala.Array", name: "arrayField") - root class: "abc.xyz.AnotherClass" ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #9810 from cloud-fan/error-message.
* [SPARK-11817][SQL] Truncating the fractional seconds to prevent inserting a NULLLiang-Chi Hsieh2015-11-202-0/+13
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-11817 Instead of return None, we should truncate the fractional seconds to prevent inserting NULL. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9834 from viirya/truncate-fractional-sec.
* [SPARK-11864][SQL] Improve performance of max/minDavies Liu2015-11-195-25/+45
| | | | | | | | | | | | | | | | This PR has the following optimization: 1) The greatest/least already does the null-check, so the `If` and `IsNull` are not necessary. 2) In greatest/least, it should initialize the result using the first child (removing one block). 3) For primitive types, the generated greater expression is too complicated (`a > b ? 1 : (a < b) ? -1 : 0) > 0`), should be as simple as `a > b` Combine these optimization, this could improve the performance of `ss_max` query by 30%. Author: Davies Liu <davies@databricks.com> Closes #9846 from davies/improve_max.
* [SPARK-11275][SQL] Incorrect results when using rollup/cubeAndrew Ray2015-11-192-34/+28
| | | | | | | | | | | | | | | | | Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null) result. Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer. Added multiple unit tests to DataFrameAggregateSuite and verified it passes hive compatibility suite: ``` build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite' ``` This is an alternative to pr https://github.com/apache/spark/pull/9419 but I think its better as it simplifies the analyzer rule instead of adding another special case to it. Author: Andrew Ray <ray.andrew@gmail.com> Closes #9815 from aray/groupingset-agg-fix.
* [SPARK-11750][SQL] revert SPARK-11727 and code clean upWenchen Fan2015-11-1911-1101/+350
| | | | | | | | After some experiment, I found it's not convenient to have separate encoder builders: `FlatEncoder` and `ProductEncoder`. For example, when create encoders for `ScalaUDF`, we have no idea if the type `T` is flat or not. So I revert the splitting change in https://github.com/apache/spark/pull/9693, while still keeping the bug fixes and tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #9726 from cloud-fan/follow.
* [SPARK-11840][SQL] Restore the 1.5's behavior of planning a single distinct ↵Yin Huai2015-11-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | aggregation. The impact of this change is for a query that has a single distinct column and does not have any grouping expression like `SELECT COUNT(DISTINCT a) FROM table` The plan will be changed from ``` AGG-2 (count distinct) Shuffle to a single reducer Partial-AGG-2 (count distinct) AGG-1 (grouping on a) Shuffle by a Partial-AGG-1 (grouping on 1) ``` to the following one (1.5 uses this) ``` AGG-2 AGG-1 (grouping on a) Shuffle to a single reducer Partial-AGG-1(grouping on a) ``` The first plan is more robust. However, to better benchmark the impact of this change, we should use 1.5's plan and use the conf of `spark.sql.specializeSingleDistinctAggPlanning` to control the plan. Author: Yin Huai <yhuai@databricks.com> Closes #9828 from yhuai/distinctRewriter.
* [SPARK-11849][SQL] Analyzer should replace current_date and ↵Reynold Xin2015-11-192-5/+60
| | | | | | | | | | | | current_timestamp with literals We currently rely on the optimizer's constant folding to replace current_timestamp and current_date. However, this can still result in different values for different instances of current_timestamp/current_date if the optimizer is not running fast enough. A better solution is to replace these functions in the analyzer in one shot. Author: Reynold Xin <rxin@databricks.com> Closes #9833 from rxin/SPARK-11849.
* [SPARK-11787][SQL] Improve Parquet scan performance when using flat schemas.Nong Li2015-11-183-12/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds an alternate to the Parquet RecordReader from the parquet-mr project that is much faster for flat schemas. Instead of using the general converter mechanism from parquet-mr, this directly uses the lower level APIs from parquet-columnar and a customer RecordReader that directly assembles into UnsafeRows. This is optionally disabled and only used for supported schemas. Using the tpcds store sales table and doing a sum of increasingly more columns, the results are: For 1 Column: Before: 11.3M rows/second After: 18.2M rows/second For 2 Columns: Before: 7.2M rows/second After: 11.2M rows/second For 5 Columns: Before: 2.9M rows/second After: 4.5M rows/second Author: Nong Li <nong@databricks.com> Closes #9774 from nongli/parquet.
* [SPARK-11833][SQL] Add Java tests for Kryo/Java Dataset encodersReynold Xin2015-11-183-38/+93
| | | | | | | | Also added some nicer error messages for incompatible types (private types and primitive types) for Kryo/Java encoder. Author: Reynold Xin <rxin@databricks.com> Closes #9823 from rxin/SPARK-11833.
* [SPARK-11636][SQL] Support classes defined in the REPL with EncodersMichael Armbrust2015-11-1814-66/+178
| | | | | | | | | | | | Before this PR there were two things that would blow up if you called `df.as[MyClass]` if `MyClass` was defined in the REPL: - [x] Because `classForName` doesn't work on the munged names returned by `tpe.erasure.typeSymbol.asClass.fullName` - [x] Because we don't have anything to pass into the constructor for the `$outer` pointer. Note that this PR is just adding the infrastructure for working with inner classes in encoder and is not yet sufficient to make them work in the REPL. Currently, the implementation show in https://github.com/marmbrus/spark/commit/95cec7d413b930b36420724fafd829bef8c732ab is causing a bug that breaks code gen due to some interaction between janino and the `ExecutorClassLoader`. This will be addressed in a follow-up PR. Author: Michael Armbrust <michael@databricks.com> Closes #9602 from marmbrus/dataset-replClasses.
* [SPARK-11810][SQL] Java-based encoder for opaque types in Datasets.Reynold Xin2015-11-183-39/+96
| | | | | | | | This patch refactors the existing Kryo encoder expressions and adds support for Java serialization. Author: Reynold Xin <rxin@databricks.com> Closes #9802 from rxin/SPARK-11810.
* [SPARK-11720][SQL][ML] Handle edge cases when count = 0 or 1 for Stats functionJihongMa2015-11-185-17/+39
| | | | | | | | return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null. Author: JihongMa <linlin200605@gmail.com> Closes #9705 from JihongMA/SPARK-11720.
* [SPARK-11725][SQL] correctly handle null inputs for UDFWenchen Fan2015-11-185-1/+107
| | | | | | | | If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null. Author: Wenchen Fan <wenchen@databricks.com> Closes #9770 from cloud-fan/udf.
* [SPARK-11802][SQL] Kryo-based encoder for opaque types in DatasetsReynold Xin2015-11-185-7/+117
| | | | | | | | I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803. Author: Reynold Xin <rxin@databricks.com> Closes #9789 from rxin/SPARK-11802.
* [SPARK-11643] [SQL] parse year with leading zeroDavies Liu2015-11-172-5/+32
| | | | | | | | Support the years between 0 <= year < 1000 Author: Davies Liu <davies@databricks.com> Closes #9701 from davies/leading_zero.
* [SPARK-8658][SQL][FOLLOW-UP] AttributeReference's equals method compares all ↵gatorsmile2015-11-172-2/+9
| | | | | | | | | | | | | | the members Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers. Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it. marmbrus cloud-fan Please review if the changes are good. Author: gatorsmile <gatorsmile@gmail.com> Closes #9761 from gatorsmile/hashCodeNamedExpression.
* [SPARK-11679][SQL] Invoking method " apply(fields: ↵mayuanwen2015-11-171-1/+2
| | | | | | | | | | | java.util.List[StructField])" in "StructType" gets ClassCastException In the previous method, fields.toArray will cast java.util.List[StructField] into Array[Object] which can not cast into Array[StructField], thus when invoking this method will throw "java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.sql.types.StructField;" I directly cast java.util.List[StructField] into Array[StructField] in this patch. Author: mayuanwen <mayuanwen@qiyi.com> Closes #9649 from jackieMaKing/Spark-11679.
* [MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuiteLiang-Chi Hsieh2015-11-161-1/+8
| | | | | | | | The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9757 from viirya/fix-randomgenerated-udt.
* [SPARK-11447][SQL] change NullType to StringType during binaryComparison ↵Kevin Yu2015-11-161-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | between NullType and StringType During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira. I proposal to the changes through this PR, can you review my code changes ? This problem only happen for <=>, other operators works fine. scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ |column| +------+ +------+ scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ |column| +------+ +------+ scala> df.registerTempTable("DF") scala> sqlContext.sql("select * from DF where 'column' = NULL") res27: org.apache.spark.sql.DataFrame = [column: string] scala> res27.show +------+ |column| +------+ +------+ Author: Kevin Yu <qyu@us.ibm.com> Closes #9720 from kevinyu98/working_on_spark-11447.
* [SPARK-11768][SPARK-9196][SQL] Support now function in SQL (alias for ↵Reynold Xin2015-11-161-0/+1
| | | | | | | | | | | | current_timestamp). This patch adds an alias for current_timestamp (now function). Also fixes SPARK-9196 to re-enable the test case for current_timestamp. Author: Reynold Xin <rxin@databricks.com> Closes #9753 from rxin/SPARK-11768.
* [SPARK-8658][SQL] AttributeReference's equals method compares all the membersgatorsmile2015-11-163-12/+14
| | | | | | | | This fix is to change the equals method to check all of the specified fields for equality of AttributeReference. Author: gatorsmile <gatorsmile@gmail.com> Closes #9216 from gatorsmile/namedExpressEqual.
* [SPARK-11553][SQL] Primitive Row accessors should not convert null to ↵Bartlomiej Alberski2015-11-162-8/+44
| | | | | | | | | | default value Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details. Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com> Closes #9642 from alberskib/bugfix/SPARK-11553.