aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SQL] remove redundant field "childOutput" from execution.Aggregate, use ↵kai2015-01-301-6/+2
| | | | | | | | | | child.output instead Author: kai <kaizeng@eecs.berkeley.edu> Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits: 78658ef [kai] remove redundant field "childOutput"
* [SPARK-5504] [sql] convertToCatalyst should support nested arraysJoseph K. Bradley2015-01-302-3/+11
| | | | | | | | | | | | | | After the recent refactoring, convertToCatalyst in ScalaReflection does not recurse on Arrays. It should. The test suite modification made the test fail before the fix in ScalaReflection. The fix makes the test suite succeed. CC: marmbrus Author: Joseph K. Bradley <joseph@databricks.com> Closes #4295 from jkbradley/SPARK-5504 and squashes the following commits: 6b7276d [Joseph K. Bradley] Fixed issue in ScalaReflection.convertToCatalyst with Arrays with non-primitive types. Modified test suite so it failed before the fix and works after the fix.
* [SPARK-5457][SQL] Add missing DSL for ApproxCountDistinct.Takuya UESHIN2015-01-301-0/+5
| | | | | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #4250 from ueshin/issues/SPARK-5457 and squashes the following commits: 3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct. faea19d [Takuya UESHIN] Use overload instead of default value for Java support. d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457 663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.
* [SQL] Support df("*") to select all columns in a data frame.Reynold Xin2015-01-297-29/+54
| | | | | | | | | | | This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for *, tblName.*) and ResolvedStar (used for df("*")). Author: Reynold Xin <rxin@databricks.com> Closes #4283 from rxin/df-star and squashes the following commits: c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar. 1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.
* [SPARK-5462] [SQL] Use analyzed query plan in DataFrame.apply()Josh Rosen2015-01-292-3/+9
| | | | | | | | | | This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names. This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`. Author: Josh Rosen <joshrosen@databricks.com> Closes #4282 from JoshRosen/SPARK-5462 and squashes the following commits: b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().
* [SQL] DataFrame API improvementsReynold Xin2015-01-296-16/+209
| | | | | | | | | | | | | | 1. Added Dsl.column in case Dsl.col is shadowed. 2. Allow using String to specify the target data type in cast. 3. Support sorting on multiple columns using column names. 4. Added Java API test file. Author: Reynold Xin <rxin@databricks.com> Closes #4280 from rxin/dsl1 and squashes the following commits: 33ecb7a [Reynold Xin] Add the Java test. d06540a [Reynold Xin] [SQL] DataFrame API improvements.
* [SPARK-4296][SQL] Trims aliases when resolving and checking aggregate ↵Yin Huai2015-01-291-0/+15
| | | | | | | | | | | | | | | | | expressions I believe that SPARK-4296 has been fixed by 3684fd21e1ffdc0adaad8ff6b31394b637e866ce. I am adding tests based #3910 (change the udf to HiveUDF instead). Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits: 6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin 6cfadd2 [Yin Huai] Actually, this issue has been fixed by 3684fd21e1ffdc0adaad8ff6b31394b637e866ce. d42b707 [Yin Huai] Update comment. 8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block, revert this change. 443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions
* [SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect resultwangfei2015-01-292-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `select key, count( * ) from src group by key, 1` will get the wrong answer. e.g. for this table ``` val testData2 = TestSQLContext.sparkContext.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toSchemaRDD testData2.registerTempTable("testData2") ``` result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1` is ``` [1,1] [2,2] [3,1] ``` Author: wangfei <wangfei1@huawei.com> Closes #4169 from scwf/agg-bug and squashes the following commits: 05751db [wangfei] fix bugs when literal in agg grouping expressioons
* [SPARK-5367][SQL] Support star expression in udfwangfei2015-01-292-5/+15
| | | | | | | | | | | | | | | | | | now spark sql does not support star expression in udf, run the following sql by spark-sql will get error ``` select concat(*) from src ``` Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #4163 from scwf/udf-star and squashes the following commits: 9db7b39 [wangfei] addressed comments da1da09 [scwf] minor fix f87b5f9 [scwf] added test case 587bf7e [wangfei] compile fix eb93c16 [wangfei] fix star resolve issue in udf
* [SPARK-4786][SQL]: Parquet filter pushdown for castable typesYash Datta2015-01-292-2/+51
| | | | | | | | | | | | Enable parquet filter pushdown of castable types like short, byte that can be cast to integer Author: Yash Datta <Yash.Datta@guavus.com> Closes #4156 from saucam/filter_short and squashes the following commits: a403979 [Yash Datta] SPARK-4786: Fix styling issues d029866 [Yash Datta] SPARK-4786: Add test case cb2e0d9 [Yash Datta] SPARK-4786: Parquet filter pushdown for castable types
* [SPARK-5309][SQL] Add support for dictionaries in PrimitiveConverter for ↵Michael Davies2015-01-292-12/+47
| | | | | | | | | | | | | | | | | | | | Strin... ...gs. Parquet Converters allow developers to take advantage of dictionary encoding of column data to reduce Column Binary decoding. The Spark PrimitiveConverter was not using that API and consequently for String columns that used dictionary compression repeated Binary to String conversions for the same String. In measurements this could account for over 25% of entire query time. For example a 500M row table split across 16 blocks was aggregated and summed in a litte under 30s before this change and a little under 20s after the change. Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #4187 from MickDavies/SPARK-5309-2 and squashes the following commits: 327287e [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings. 33c002c [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
* [SPARK-5429][SQL] Use javaXML plan serialization for Hive golden answers on ↵Liang-Chi Hsieh2015-01-291-0/+2
| | | | | | | | | | | | Hive 0.13.1 I found that running `HiveComparisonTest.createQueryTest` to generate Hive golden answer files on Hive 0.13.1 would throw KryoException. I am not sure if this can be reproduced by others. Since Hive 0.13.0, Kryo plan serialization is introduced to replace javaXML as default plan serialization format. This is a quick fix to set hive configuration to use javaXML serialization. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4223 from viirya/fix_hivetest and squashes the following commits: 97a8760 [Liang-Chi Hsieh] Use javaXML plan serialization.
* [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.Reynold Xin2015-01-2918-134/+35
| | | | | | | | | | | Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl. Author: Reynold Xin <rxin@databricks.com> Closes #4276 from rxin/dsl and squashes the following commits: 30aa611 [Reynold Xin] Add all files. 1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
* [SQL] Various DataFrame DSL update.Reynold Xin2015-01-297-9/+94
| | | | | | | | | | | | | | | | | 1. Added foreach, foreachPartition, flatMap to DataFrame. 2. Added col() in dsl. 3. Support renaming columns in toDataFrame. 4. Support type inference on arrays (in addition to Seq). 5. Updated mllib to use the new DSL. Author: Reynold Xin <rxin@databricks.com> Closes #4260 from rxin/sql-dsl-update and squashes the following commits: 73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve. fab3ccc [Reynold Xin] Bug fix. d31fcd2 [Reynold Xin] Style fix. 62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.
* [SPARK-5445][SQL] Made DataFrame dsl usable in JavaReynold Xin2015-01-2822-277/+298
| | | | | | | | | | | | | | | | | Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago. Author: Reynold Xin <rxin@databricks.com> Closes #4241 from rxin/df-docupdate and squashes the following commits: c0f4810 [Reynold Xin] Fix Python merge conflict. 094c7d7 [Reynold Xin] Minor style fix. Reset Python tests. 3c89f4a [Reynold Xin] Package. dfe6962 [Reynold Xin] Updated Python aggregate. 5dd4265 [Reynold Xin] Made dsl Java callable. 14b3c27 [Reynold Xin] Fix literal expression for symbols. 68b31cb [Reynold Xin] Literal. 4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.
* [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.Reynold Xin2015-01-2826-194/+208
| | | | | | | | | | | | | | and [SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext Author: Reynold Xin <rxin@databricks.com> Closes #4242 from rxin/sqlCleanup and squashes the following commits: e351cb2 [Reynold Xin] Fixed toDataFrame. 6545c42 [Reynold Xin] More changes. 728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
* [SPARK-5097][SQL] Test cases for DataFrame expressions.Reynold Xin2015-01-277-73/+315
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4235 from rxin/df-tests1 and squashes the following commits: f341db6 [Reynold Xin] [SPARK-5097][SQL] Test cases for DataFrame expressions.
* [SPARK-5097][SQL] DataFrameReynold Xin2015-01-2750-1073/+2494
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities. TODOs: With the exception of Python support, other tasks can be done in separate, follow-up PRs. - [ ] Audit of the API - [ ] Documentation - [ ] More test cases to cover the new API - [x] Python support - [ ] Type alias SchemaRDD Author: Reynold Xin <rxin@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4173 from rxin/df1 and squashes the following commits: 0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1 23b4427 [Reynold Xin] Mima. 828f70d [Reynold Xin] Merge pull request #7 from davies/df 257b9e6 [Davies Liu] add repartition 6bf2b73 [Davies Liu] fix collect with UDT and tests e971078 [Reynold Xin] Missing quotes. b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now. a728bf2 [Reynold Xin] Example rename. e8aa3d3 [Reynold Xin] groupby -> groupBy. 9662c9e [Davies Liu] improve DataFrame Python API 4ae51ea [Davies Liu] python API for dataframe 1e5e454 [Reynold Xin] Fixed a bug with symbol conversion. 2ca74db [Reynold Xin] Couple minor fixes. ea98ea1 [Reynold Xin] Documentation & literal expressions. 2b22684 [Reynold Xin] Got rid of IntelliJ problems. 02bbfbc [Reynold Xin] Tightening imports. ffbce66 [Reynold Xin] Fixed compilation error. 59b6d8b [Reynold Xin] Style violation. b85edfb [Reynold Xin] ALS. 8c37f0a [Reynold Xin] Made MLlib and examples compile 6d53134 [Reynold Xin] Hive module. d35efd5 [Reynold Xin] Fixed compilation error. ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite. 66d5ef1 [Reynold Xin] SQLContext minor patch. c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
* [SPARK-5202] [SQL] Add hql variable substitution supportCheng Hao2015-01-212-2/+22
| | | | | | | | | | | | | https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, it impacts the existed hql scripts from Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #4003 from chenghao-intel/substitution and squashes the following commits: bb41fd6 [Cheng Hao] revert the removed the implicit conversion af7c31a [Cheng Hao] add hql variable substitution support
* [SQL] [Minor] Remove deprecated parquet testsCheng Lian2015-01-213-1289/+212
| | | | | | | | | | | | | | | | This PR removes the deprecated `ParquetQuerySuite`, renamed `ParquetQuerySuite2` to `ParquetQuerySuite`, and refactored changes introduced in #4115 to `ParquetFilterSuite` . It is a follow-up of #3644. Notice that test cases in the old `ParquetQuerySuite` have already been well covered by other test suites introduced in #3644. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4116) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4116 from liancheng/remove-deprecated-parquet-tests and squashes the following commits: f73b8f9 [Cheng Lian] Removes deprecated Parquet test suite
* Revert "[SPARK-5244] [SQL] add coalesce() in sql parser"Josh Rosen2015-01-212-11/+0
| | | | This reverts commit 812d3679f5f97df7b667cbc3365a49866ebc02d5.
* [SPARK-5009] [SQL] Long keyword support in SQL ParsersCheng Hao2015-01-218-81/+128
| | | | | | | | | | | * The `SqlLexical.allCaseVersions` will cause `StackOverflowException` if the key word is too long, the patch will fix that by normalizing all of the keywords in `SqlLexical`. * And make a unified SparkSQLParser for sharing the common code. Author: Cheng Hao <hao.cheng@intel.com> Closes #3926 from chenghao-intel/long_keyword and squashes the following commits: 686660f [Cheng Hao] Support Long Keyword and Refactor the SQLParsers
* [SPARK-5244] [SQL] add coalesce() in sql parserDaoyuan Wang2015-01-212-0/+11
| | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4040 from adrian-wang/coalesce and squashes the following commits: 0ac8e8f [Daoyuan Wang] add coalesce() in sql parser
* [SPARK-5323][SQL] Remove Row's Seq inheritance.Reynold Xin2015-01-2047-956/+1018
| | | | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4115 from rxin/row-seq and squashes the following commits: e33abd8 [Reynold Xin] Fixed compilation error. cceb650 [Reynold Xin] Python test fixes, and removal of WrapDynamic. 0334a52 [Reynold Xin] mkString. 9cdeb7d [Reynold Xin] Hive tests. 15681c2 [Reynold Xin] Fix more test cases. ea9023a [Reynold Xin] Fixed a catalyst test. c5e2cb5 [Reynold Xin] Minor patch up. b9cab7c [Reynold Xin] [SPARK-5323][SQL] Remove Row's Seq inheritance.
* [SPARK-5287][SQL] Add defaultSizeOf to every data type.Yin Huai2015-01-205-48/+201
| | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-5287 This PR only add `defaultSizeOf` to data types and make those internal type classes `protected[sql]`. I will use another PR to cleanup the type hierarchy of data types. Author: Yin Huai <yhuai@databricks.com> Closes #4081 from yhuai/SPARK-5287 and squashes the following commits: 90cec75 [Yin Huai] Update unit test. e1c600c [Yin Huai] Make internal classes protected[sql]. 7eaba68 [Yin Huai] Add `defaultSize` method to data types. fd425e0 [Yin Huai] Add all native types to NativeType.defaultSizeOf.
* [SQL][Minor] Refactors deeply nested FP style code in BooleanSimplificationCheng Lian2015-01-202-37/+57
| | | | | | | | | | | | | | | | This is a follow-up of #4090. The original deeply nested `reduceOption` code is hard to grasp. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4091) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4091 from liancheng/refactor-boolean-simplification and squashes the following commits: cd8860b [Cheng Lian] Improves `compareConditions` to handle more subtle cases 1bf3258 [Cheng Lian] Avoids converting predicate sets to lists e833ca4 [Cheng Lian] Refactors deeply nested FP style code
* [SQL][minor] Add a log4j file for catalyst test.Reynold Xin2015-01-201-0/+28
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4117 from rxin/catalyst-test-log4j and squashes the following commits: 8ad610b [Reynold Xin] [SQL][minor] Add a log4j file for catalyst test.
* [SPARK-5286][SQL] Fail to drop an invalid table when using the data source APIYin Huai2015-01-192-0/+18
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-5286 Author: Yin Huai <yhuai@databricks.com> Closes #4076 from yhuai/SPARK-5286 and squashes the following commits: 6b69ed1 [Yin Huai] Catch all exception when we try to uncache a query.
* [SPARK-5284][SQL] Insert into Hive throws NPE when a inner complex type ↵Yin Huai2015-01-192-9/+54
| | | | | | | | | | | | field has a null value JIRA: https://issues.apache.org/jira/browse/SPARK-5284 Author: Yin Huai <yhuai@databricks.com> Closes #4077 from yhuai/SPARK-5284 and squashes the following commits: fceacd6 [Yin Huai] Check if a value is null when the field has a complex type.
* [SQL] fix typo in class descriptionJacky Li2015-01-181-3/+3
| | | | | | | | | | Author: Jacky Li <jacky.likun@gmail.com> Closes #4100 from jackylk/patch-9 and squashes the following commits: b13b9d6 [Jacky Li] Update SQLConf.scala 4d3f83d [Jacky Li] Update SQLConf.scala fcc8c85 [Jacky Li] [SQL] fix typo in class description
* [SQL][minor] Put DataTypes.java in java dir.Reynold Xin2015-01-181-0/+0
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4097 from rxin/javarename and squashes the following commits: c5ce96a [Reynold Xin] [SQL][minor] Put DataTypes.java in java dir.
* [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.Reynold Xin2015-01-1827-77/+101
| | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4092 from rxin/bigdecimal and squashes the following commits: 27b08c9 [Reynold Xin] Fixed test. 10cb496 [Reynold Xin] [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.
* [SQL][Minor] Added comments and examples to explain BooleanSimplificationReynold Xin2015-01-171-83/+94
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4090 from rxin/booleanSimplification and squashes the following commits: 68c8986 [Reynold Xin] [SQL][Minor] Added comments and examples to explain BooleanSimplification.
* [SPARK-4937][SQL] Comment for the newly optimization rules in ↵scwf2015-01-171-2/+16
| | | | | | | | | | | | | | `BooleanSimplification` Follow up of #3778 /cc rxin Author: scwf <wangfei1@huawei.com> Closes #4086 from scwf/commentforspark-4937 and squashes the following commits: aaf89f6 [scwf] code style issue 2d3406e [scwf] added comment for spark-4937
* [SQL][minor] Improved Row documentation.Reynold Xin2015-01-171-52/+114
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4085 from rxin/row-doc and squashes the following commits: f77cb27 [Reynold Xin] [SQL][minor] Improved Row documentation.
* [SPARK-5193][SQL] Remove Spark SQL Java-specific API.Reynold Xin2015-01-1615-1350/+46
| | | | | | | | | | | | | | | | | | | | | | | | | After the following patches, the main (Scala) API is now usable for Java users directly. https://github.com/apache/spark/pull/4056 https://github.com/apache/spark/pull/4054 https://github.com/apache/spark/pull/4049 https://github.com/apache/spark/pull/4030 https://github.com/apache/spark/pull/3965 https://github.com/apache/spark/pull/3958 Author: Reynold Xin <rxin@databricks.com> Closes #4065 from rxin/sql-java-api and squashes the following commits: b1fd860 [Reynold Xin] Fix Mima 6d86578 [Reynold Xin] Ok one more attempt in fixing Python... e8f1455 [Reynold Xin] Fix Python again... 3e53f91 [Reynold Xin] Fixed Python. 83735da [Reynold Xin] Fix BigDecimal test. e9f1de3 [Reynold Xin] Use scala BigDecimal. 500d2c4 [Reynold Xin] Fix Decimal. ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory. c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
* [SPARK-4937][SQL] Adding optimization to simplify the And, Or condition in ↵scwf2015-01-163-90/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spark sql Adding optimization to simplify the And/Or condition in spark sql. There are two kinds of Optimization 1 Numeric condition optimization, such as: a < 3 && a > 5 ---- False a < 1 || a > 0 ---- True a > 3 && a > 5 => a > 5 (a < 2 || b > 5) && a < 2 => a < 2 2 optimizing the some query from a cartesian product into equi-join, such as this sql (one of hive-testbench): ``` select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 7 and l_quantity <= 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 15 and l_quantity <= 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 26 and l_quantity <= 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) ``` It has a repeated expression in Or, so we can optimize it by ``` (a && b) || (a && c) = a && (b || c)``` Before optimization, this sql hang in my locally test, and the physical plan is: ![image](https://cloud.githubusercontent.com/assets/7018048/5539175/31cf38e8-8af9-11e4-95e3-336f9b3da4a4.png) After optimization, this sql run successfully in 20+ seconds, and its physical plan is: ![image](https://cloud.githubusercontent.com/assets/7018048/5539176/39a558e0-8af9-11e4-912b-93de94b20075.png) This PR focus on the second optimization and some simple ones of the first. For complex Numeric condition optimization, I will make a follow up PR. Author: scwf <wangfei1@huawei.com> Author: wangfei <wangfei1@huawei.com> Closes #3778 from scwf/filter1 and squashes the following commits: 58bcbc2 [scwf] minor format fix 9570211 [scwf] conflicts fix 527e6ce [scwf] minor comment improvements 5c6f134 [scwf] remove numeric optimizations and move to BooleanSimplification 546a82b [wangfei] style fix 825fa69 [wangfei] adding more tests a001e8c [wangfei] revert pom changes 32a595b [scwf] improvement and test fix e99a26c [wangfei] refactory And/Or optimization to make it more readable and clean
* [SPARK-5274][SQL] Reconcile Java and Scala UDFRegistration.Reynold Xin2015-01-156-70/+666
| | | | | | | | | | | | | | | | | | | | As part of SPARK-5193: 1. Removed UDFRegistration as a mixin in SQLContext and made it a field ("udf"). 2. For Java UDFs, renamed dataType to returnType. 3. For Scala UDFs, added type tags. 4. Added all Java UDF registration methods to Scala's UDFRegistration. 5. Documentation Author: Reynold Xin <rxin@databricks.com> Closes #4056 from rxin/udf-registration and squashes the following commits: ae9c556 [Reynold Xin] Updated example. 675a3c9 [Reynold Xin] Style fix 47c24ff [Reynold Xin] Python fix. 5f00c45 [Reynold Xin] Restore data type position in java udf and added typetags. 032f006 [Reynold Xin] [SPARK-5193][SQL] Reconcile Java and Scala UDFRegistration.
* [SPARK-5193][SQL] Tighten up HiveContext APIReynold Xin2015-01-141-35/+13
| | | | | | | | | | | | | | 1. Removed the deprecated LocalHiveContext 2. Made private[sql] fields protected[sql] so they don't show up in javadoc. 3. Added javadoc to refreshTable. 4. Added Experimental tag to analyze command. Author: Reynold Xin <rxin@databricks.com> Closes #4054 from rxin/hivecontext-api and squashes the following commits: 25cc00a [Reynold Xin] Add implicit conversion back. cbca886 [Reynold Xin] [SPARK-5193][SQL] Tighten up HiveContext API
* [SPARK-5193][SQL] Tighten up SQLContext APIReynold Xin2015-01-1410-281/+152
| | | | | | | | | | | | | | | | 1. Removed 2 implicits (logicalPlanToSparkQuery and baseRelationToSchemaRDD) 2. Moved extraStrategies into ExperimentalMethods. 3. Made private methods protected[sql] so they don't show up in javadocs. 4. Removed createParquetFile. 5. Added Java version of applySchema to SQLContext. Author: Reynold Xin <rxin@databricks.com> Closes #4049 from rxin/sqlContext-refactor and squashes the following commits: a326a1a [Reynold Xin] Remove createParquetFile and add applySchema for Java to SQLContext. ecd6685 [Reynold Xin] Added baseRelationToSchemaRDD back. 4a38c9b [Reynold Xin] [SPARK-5193][SQL] Tighten up SQLContext API
* [SPARK-5235] Make SQLConf SerializableAlex Baretta2015-01-141-1/+1
| | | | | | | | | | Declare SQLConf to be serializable to fix "Task not serializable" exceptions in SparkSQL Author: Alex Baretta <alexbaretta@gmail.com> Closes #4031 from alexbaretta/SPARK-5235-SQLConf and squashes the following commits: c2103f5 [Alex Baretta] [SPARK-5235] Make SQLConf Serializable
* [SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptIdJosh Rosen2015-01-142-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | `TaskContext.attemptId` is misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted. This patch deprecates `TaskContext.attemptId` and add `TaskContext.taskId` and `TaskContext.attemptNumber` fields. Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks. Earlier versions of the TaskContext docs suggest that `attemptId` behaves like `attemptNumber`, so there's an argument to be made in favor of changing this method's implementation. Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior for `attemptId`; if `attemptId` behaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the new `attemptId` behavior. Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the `data` field alongside the task binary. Author: Josh Rosen <joshrosen@databricks.com> Closes #3849 from JoshRosen/SPARK-4014 and squashes the following commits: 89d03e0 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014 5cfff05 [Josh Rosen] Introduce wrapper for serializing Mesos task launch data. 38574d4 [Josh Rosen] attemptId -> taskAttemptId in PairRDDFunctions a180b88 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014 1d43aa6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014 eee6a45 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014 0b10526 [Josh Rosen] Use putInt instead of putLong (silly mistake) 8c387ce [Josh Rosen] Use local with maxRetries instead of local-cluster. cbe4d76 [Josh Rosen] Preserve attemptId behavior and deprecate it: b2dffa3 [Josh Rosen] Address some of Reynold's minor comments 9d8d4d1 [Josh Rosen] Doc typo 1e7a933 [Josh Rosen] [SPARK-4014] Change TaskContext.attemptId to return attempt number instead of task ID. fd515a5 [Josh Rosen] Add failing test for SPARK-4014
* [SQL] some comments fix for GROUPING SETSDaoyuan Wang2015-01-141-6/+6
| | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4000 from adrian-wang/comment and squashes the following commits: 9c24fc4 [Daoyuan Wang] some comments
* [SPARK-5211][SQL]Restore HiveMetastoreTypes.toDataTypeYin Huai2015-01-142-5/+8
| | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-5211 Author: Yin Huai <yhuai@databricks.com> Closes #4026 from yhuai/SPARK-5211 and squashes the following commits: 15ee32b [Yin Huai] Remove extra line. c6c1651 [Yin Huai] Get back HiveMetastoreTypes.toDataType.
* [SPARK-5248] [SQL] move sql.types.decimal.Decimal to sql.types.DecimalDaoyuan Wang2015-01-1422-29/+13
| | | | | | | | | | | rxin follow up of #3732 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4041 from adrian-wang/decimal and squashes the following commits: aa3d738 [Daoyuan Wang] fix auto refactor 7777a58 [Daoyuan Wang] move sql.types.decimal.Decimal to sql.types.Decimal
* [SPARK-5167][SQL] Move Row into sql package and make it usable for Java.Reynold Xin2015-01-147-174/+304
| | | | | | | | | | | | | | | | | | | | Mostly just moving stuff around. This should still be source compatible since we type aliased Row previously in org.apache.spark.sql.Row. Added the following APIs to Row: ```scala def getMap[K, V](i: Int): scala.collection.Map[K, V] def getJavaMap[K, V](i: Int): java.util.Map[K, V] def getSeq[T](i: Int): Seq[T] def getList[T](i: Int): java.util.List[T] def getStruct(i: Int): StructType ``` Author: Reynold Xin <rxin@databricks.com> Closes #4030 from rxin/sql-row and squashes the following commits: 6c85c29 [Reynold Xin] Fixed style violation by adding a new line to Row.scala. 82b064a [Reynold Xin] [SPARK-5167][SQL] Move Row into sql package and make it usable for Java.
* [SPARK-5123][SQL] Reconcile Java/Scala API for data types.Reynold Xin2015-01-13149-2085/+729
| | | | | | | | | | | | | | Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box. As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code. This subsumes https://github.com/apache/spark/pull/3925 Author: Reynold Xin <rxin@databricks.com> Closes #3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits: 66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).
* [SPARK-5168] Make SQLConf a field rather than mixin in SQLContextReynold Xin2015-01-1333-92/+124
| | | | | | | | | | | | | This change should be binary and source backward compatible since we didn't change any user facing APIs. Author: Reynold Xin <rxin@databricks.com> Closes #3965 from rxin/SPARK-5168-sqlconf and squashes the following commits: 42eec09 [Reynold Xin] Fix default conf value. 0ef86cc [Reynold Xin] Fix constructor ordering. 4d7f910 [Reynold Xin] Properly override config. ccc8e6a [Reynold Xin] [SPARK-5168] Make SQLConf a field rather than mixin in SQLContext
* [SPARK-4912][SQL] Persistent tables for the Spark SQL data sources apiYin Huai2015-01-1314-28/+461
| | | | | | | | | | | | | | | | | | | | | | | | | | | With changes in this PR, users can persist metadata of tables created based on the data source API in metastore through DDLs. Author: Yin Huai <yhuai@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #3960 from yhuai/persistantTablesWithSchema2 and squashes the following commits: 069c235 [Yin Huai] Make exception messages user friendly. c07cbc6 [Yin Huai] Get the location of test file in a correct way. 4456e98 [Yin Huai] Test data. 5315dfc [Yin Huai] rxin's comments. 7fc4b56 [Yin Huai] Add DDLStrategy and HiveDDLStrategy to plan DDLs based on the data source API. aeaf4b3 [Yin Huai] Add comments. 06f9b0c [Yin Huai] Revert unnecessary changes. feb88aa [Yin Huai] Merge remote-tracking branch 'apache/master' into persistantTablesWithSchema2 172db80 [Yin Huai] Fix unit test. 49bf1ac [Yin Huai] Unit tests. 8f8f1a1 [Yin Huai] [SPARK-4574][SQL] Adding support for defining schema in foreign DDL commands. #3431 f47fda1 [Yin Huai] Unit tests. 2b59723 [Michael Armbrust] Set external when creating tables c00bb1b [Michael Armbrust] Don't use reflection to read options 1ea6e7b [Michael Armbrust] Don't fail when trying to uncache a table that doesn't exist 6edc710 [Michael Armbrust] Add tests. d7da491 [Michael Armbrust] First draft of persistent tables.
* [SPARK-5049][SQL] Fix ordering of partition columns in ParquetTableScanMichael Armbrust2015-01-123-18/+41
| | | | | | | | | | Followup to #3870. Props to rahulaggarwalguavus for identifying the issue. Author: Michael Armbrust <michael@databricks.com> Closes #3990 from marmbrus/SPARK-5049 and squashes the following commits: dd03e4e [Michael Armbrust] Fill in the partition values of parquet scans instead of using JoinedRow