aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SQL] Improve DataFrame API error reportingReynold Xin2015-02-0218-356/+840
| | | | | | | | | | | | | | | | | | 1. Throw UnsupportedOperationException if a Column is not computable. 2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run). Author: Reynold Xin <rxin@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4296 from rxin/col-computability and squashes the following commits: 6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability f79034c [Davies Liu] fix python tests 5afe1ff [Reynold Xin] Fix scala test. 17f6bae [Reynold Xin] Various fixes. b932e86 [Reynold Xin] Added eager analysis for error reporting. e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn
* Revert "[SPARK-4508] [SQL] build native date type to conform behavior to Hive"Patrick Wendell2015-02-0249-191/+112
| | | | This reverts commit 1646f89d967913ee1f231d9606f8502d13c25804.
* [SPARK-5514] DataFrame.collect should call executeCollectReynold Xin2015-02-021-1/+1
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4313 from rxin/SPARK-5514 and squashes the following commits: e34e91b [Reynold Xin] [SPARK-5514] DataFrame.collect should call executeCollect
* [SPARK-5195][sql]Update HiveMetastoreCatalog.scala(override the ↵seayi2015-02-022-0/+15
| | | | | | | | | | | | | | | | | | | | | | | MetastoreRelation's sameresult method only compare databasename and table name) override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count(*) from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count(*) from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name Author: seayi <405078363@qq.com> Author: Michael Armbrust <michael@databricks.com> Closes #3898 from seayi/branch-1.2 and squashes the following commits: 8f0c7d2 [seayi] Update CachedTableSuite.scala a277120 [seayi] Update HiveMetastoreCatalog.scala 8d910aa [seayi] Update HiveMetastoreCatalog.scala
* [SPARK-4508] [SQL] build native date type to conform behavior to HiveDaoyuan Wang2015-02-0249-112/+191
| | | | | | | | | | | | | | | | | | Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures the same comparison behavior of Hive and Catalyst. Subsumes #3381 I thinks there are already some tests in JavaSQLSuite, and for python it will not affect python's datetime class. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3732 from adrian-wang/datenative and squashes the following commits: 0ed0fdc [Daoyuan Wang] fix test data a2fdd4e [Daoyuan Wang] getDate c37832b [Daoyuan Wang] row to catalyst f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion 024c9a6 [Daoyuan Wang] clean some import order d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally 374abd5 [Daoyuan Wang] spark native date type support
* [SPARK-5212][SQL] Add support of schema-less, custom field delimiter and ↵Liang-Chi Hsieh2015-02-0217-34/+5360
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SerDe for HiveQL transform This pr adds the support of schema-less syntax, custom field delimiter and SerDe for HiveQL's transform. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4014 from viirya/schema_less_trans and squashes the following commits: ac2d1fe [Liang-Chi Hsieh] Refactor codes for comments. a137933 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans aa10fbd [Liang-Chi Hsieh] Add Hive golden answer files again. 575f695 [Liang-Chi Hsieh] Add Hive golden answer files for new unit tests. a422562 [Liang-Chi Hsieh] Use createQueryTest for unit tests and remove unnecessary imports. ccb71e3 [Liang-Chi Hsieh] Refactor codes for comments. 37bd391 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans 6000889 [Liang-Chi Hsieh] Wrap input and output schema into ScriptInputOutputSchema. 21727f7 [Liang-Chi Hsieh] Move schema-less output to proper place. Use multilines instead of a long line SQL. 9a6dc04 [Liang-Chi Hsieh] setRecordReaderID is introduced in 0.13.1, use reflection API to call it. 7a14f31 [Liang-Chi Hsieh] Fix bug. 799b5e1 [Liang-Chi Hsieh] Call getSerializedClass instead of using Text. be2c3fc [Liang-Chi Hsieh] Fix style. 32d3046 [Liang-Chi Hsieh] Add SerDe support. ab22f7b [Liang-Chi Hsieh] Fix style. 7a48e42 [Liang-Chi Hsieh] Add support of custom field delimiter. b1729d9 [Liang-Chi Hsieh] Fix style. ccee49e [Liang-Chi Hsieh] Add unit test. f561c37 [Liang-Chi Hsieh] Add support of schema-less script transformation.
* [SPARK-5465] [SQL] Fixes filter push-down for Parquet data sourceCheng Lian2015-02-011-8/+10
| | | | | | | | | | | | | | Not all Catalyst filter expressions can be converted to Parquet filter predicates. We should try to convert each individual predicate and then collect those convertible ones. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4255) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4255 from liancheng/spark-5465 and squashes the following commits: 14ccd37 [Cheng Lian] Fixes filter push-down for Parquet data source
* [SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types ↵Daoyuan Wang2015-02-016-0/+65
| | | | | | | | | | | | | | | for parameters of coalesce I'll add test case in #4040 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4057 from adrian-wang/coal and squashes the following commits: 4d0111a [Daoyuan Wang] address Yin's comments c393e18 [Daoyuan Wang] fix rebase conflicts e47c03a [Daoyuan Wang] add coalesce in parser c74828d [Daoyuan Wang] cast types for coalesce
* [SPARK-5196][SQL] Support `comment` in Create Table Field DDLOopsOutOfMemory2015-02-012-3/+28
| | | | | | | | | | | Support `comment` in create a table field. __CREATE TEMPORARY TABLE people(name string `comment` "the name of a person")__ Author: OopsOutOfMemory <victorshengli@126.com> Closes #3999 from OopsOutOfMemory/meta_comment and squashes the following commits: 39150d4 [OopsOutOfMemory] add comment and refine test suite
* [Minor][SQL] Little refactor DataFrame related codesLiang-Chi Hsieh2015-02-012-3/+3
| | | | | | | | | | | | | | | Simplify some codes related to DataFrame. * Calling `toAttributes` instead of a `map`. * Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4298 from viirya/refactor_df and squashes the following commits: 1d61c64 [Liang-Chi Hsieh] Revert it. f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame. 2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.
* [SQL] remove redundant field "childOutput" from execution.Aggregate, use ↵kai2015-01-301-6/+2
| | | | | | | | | | child.output instead Author: kai <kaizeng@eecs.berkeley.edu> Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits: 78658ef [kai] remove redundant field "childOutput"
* [SPARK-5504] [sql] convertToCatalyst should support nested arraysJoseph K. Bradley2015-01-302-3/+11
| | | | | | | | | | | | | | After the recent refactoring, convertToCatalyst in ScalaReflection does not recurse on Arrays. It should. The test suite modification made the test fail before the fix in ScalaReflection. The fix makes the test suite succeed. CC: marmbrus Author: Joseph K. Bradley <joseph@databricks.com> Closes #4295 from jkbradley/SPARK-5504 and squashes the following commits: 6b7276d [Joseph K. Bradley] Fixed issue in ScalaReflection.convertToCatalyst with Arrays with non-primitive types. Modified test suite so it failed before the fix and works after the fix.
* [SPARK-5457][SQL] Add missing DSL for ApproxCountDistinct.Takuya UESHIN2015-01-301-0/+5
| | | | | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #4250 from ueshin/issues/SPARK-5457 and squashes the following commits: 3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct. faea19d [Takuya UESHIN] Use overload instead of default value for Java support. d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457 663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.
* [SQL] Support df("*") to select all columns in a data frame.Reynold Xin2015-01-297-29/+54
| | | | | | | | | | | This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for *, tblName.*) and ResolvedStar (used for df("*")). Author: Reynold Xin <rxin@databricks.com> Closes #4283 from rxin/df-star and squashes the following commits: c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar. 1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.
* [SPARK-5462] [SQL] Use analyzed query plan in DataFrame.apply()Josh Rosen2015-01-292-3/+9
| | | | | | | | | | This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names. This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`. Author: Josh Rosen <joshrosen@databricks.com> Closes #4282 from JoshRosen/SPARK-5462 and squashes the following commits: b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().
* [SQL] DataFrame API improvementsReynold Xin2015-01-296-16/+209
| | | | | | | | | | | | | | 1. Added Dsl.column in case Dsl.col is shadowed. 2. Allow using String to specify the target data type in cast. 3. Support sorting on multiple columns using column names. 4. Added Java API test file. Author: Reynold Xin <rxin@databricks.com> Closes #4280 from rxin/dsl1 and squashes the following commits: 33ecb7a [Reynold Xin] Add the Java test. d06540a [Reynold Xin] [SQL] DataFrame API improvements.
* [SPARK-4296][SQL] Trims aliases when resolving and checking aggregate ↵Yin Huai2015-01-291-0/+15
| | | | | | | | | | | | | | | | | expressions I believe that SPARK-4296 has been fixed by 3684fd21e1ffdc0adaad8ff6b31394b637e866ce. I am adding tests based #3910 (change the udf to HiveUDF instead). Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits: 6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin 6cfadd2 [Yin Huai] Actually, this issue has been fixed by 3684fd21e1ffdc0adaad8ff6b31394b637e866ce. d42b707 [Yin Huai] Update comment. 8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block, revert this change. 443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions
* [SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect resultwangfei2015-01-292-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `select key, count( * ) from src group by key, 1` will get the wrong answer. e.g. for this table ``` val testData2 = TestSQLContext.sparkContext.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toSchemaRDD testData2.registerTempTable("testData2") ``` result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1` is ``` [1,1] [2,2] [3,1] ``` Author: wangfei <wangfei1@huawei.com> Closes #4169 from scwf/agg-bug and squashes the following commits: 05751db [wangfei] fix bugs when literal in agg grouping expressioons
* [SPARK-5367][SQL] Support star expression in udfwangfei2015-01-292-5/+15
| | | | | | | | | | | | | | | | | | now spark sql does not support star expression in udf, run the following sql by spark-sql will get error ``` select concat(*) from src ``` Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #4163 from scwf/udf-star and squashes the following commits: 9db7b39 [wangfei] addressed comments da1da09 [scwf] minor fix f87b5f9 [scwf] added test case 587bf7e [wangfei] compile fix eb93c16 [wangfei] fix star resolve issue in udf
* [SPARK-4786][SQL]: Parquet filter pushdown for castable typesYash Datta2015-01-292-2/+51
| | | | | | | | | | | | Enable parquet filter pushdown of castable types like short, byte that can be cast to integer Author: Yash Datta <Yash.Datta@guavus.com> Closes #4156 from saucam/filter_short and squashes the following commits: a403979 [Yash Datta] SPARK-4786: Fix styling issues d029866 [Yash Datta] SPARK-4786: Add test case cb2e0d9 [Yash Datta] SPARK-4786: Parquet filter pushdown for castable types
* [SPARK-5309][SQL] Add support for dictionaries in PrimitiveConverter for ↵Michael Davies2015-01-292-12/+47
| | | | | | | | | | | | | | | | | | | | Strin... ...gs. Parquet Converters allow developers to take advantage of dictionary encoding of column data to reduce Column Binary decoding. The Spark PrimitiveConverter was not using that API and consequently for String columns that used dictionary compression repeated Binary to String conversions for the same String. In measurements this could account for over 25% of entire query time. For example a 500M row table split across 16 blocks was aggregated and summed in a litte under 30s before this change and a little under 20s after the change. Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #4187 from MickDavies/SPARK-5309-2 and squashes the following commits: 327287e [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings. 33c002c [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
* [SPARK-5429][SQL] Use javaXML plan serialization for Hive golden answers on ↵Liang-Chi Hsieh2015-01-291-0/+2
| | | | | | | | | | | | Hive 0.13.1 I found that running `HiveComparisonTest.createQueryTest` to generate Hive golden answer files on Hive 0.13.1 would throw KryoException. I am not sure if this can be reproduced by others. Since Hive 0.13.0, Kryo plan serialization is introduced to replace javaXML as default plan serialization format. This is a quick fix to set hive configuration to use javaXML serialization. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4223 from viirya/fix_hivetest and squashes the following commits: 97a8760 [Liang-Chi Hsieh] Use javaXML plan serialization.
* [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.Reynold Xin2015-01-2918-134/+35
| | | | | | | | | | | Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl. Author: Reynold Xin <rxin@databricks.com> Closes #4276 from rxin/dsl and squashes the following commits: 30aa611 [Reynold Xin] Add all files. 1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
* [SQL] Various DataFrame DSL update.Reynold Xin2015-01-297-9/+94
| | | | | | | | | | | | | | | | | 1. Added foreach, foreachPartition, flatMap to DataFrame. 2. Added col() in dsl. 3. Support renaming columns in toDataFrame. 4. Support type inference on arrays (in addition to Seq). 5. Updated mllib to use the new DSL. Author: Reynold Xin <rxin@databricks.com> Closes #4260 from rxin/sql-dsl-update and squashes the following commits: 73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve. fab3ccc [Reynold Xin] Bug fix. d31fcd2 [Reynold Xin] Style fix. 62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.
* [SPARK-5445][SQL] Made DataFrame dsl usable in JavaReynold Xin2015-01-2822-277/+298
| | | | | | | | | | | | | | | | | Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago. Author: Reynold Xin <rxin@databricks.com> Closes #4241 from rxin/df-docupdate and squashes the following commits: c0f4810 [Reynold Xin] Fix Python merge conflict. 094c7d7 [Reynold Xin] Minor style fix. Reset Python tests. 3c89f4a [Reynold Xin] Package. dfe6962 [Reynold Xin] Updated Python aggregate. 5dd4265 [Reynold Xin] Made dsl Java callable. 14b3c27 [Reynold Xin] Fix literal expression for symbols. 68b31cb [Reynold Xin] Literal. 4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.
* [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.Reynold Xin2015-01-2826-194/+208
| | | | | | | | | | | | | | and [SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext Author: Reynold Xin <rxin@databricks.com> Closes #4242 from rxin/sqlCleanup and squashes the following commits: e351cb2 [Reynold Xin] Fixed toDataFrame. 6545c42 [Reynold Xin] More changes. 728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
* [SPARK-5097][SQL] Test cases for DataFrame expressions.Reynold Xin2015-01-277-73/+315
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4235 from rxin/df-tests1 and squashes the following commits: f341db6 [Reynold Xin] [SPARK-5097][SQL] Test cases for DataFrame expressions.
* [SPARK-5097][SQL] DataFrameReynold Xin2015-01-2750-1073/+2494
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities. TODOs: With the exception of Python support, other tasks can be done in separate, follow-up PRs. - [ ] Audit of the API - [ ] Documentation - [ ] More test cases to cover the new API - [x] Python support - [ ] Type alias SchemaRDD Author: Reynold Xin <rxin@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4173 from rxin/df1 and squashes the following commits: 0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1 23b4427 [Reynold Xin] Mima. 828f70d [Reynold Xin] Merge pull request #7 from davies/df 257b9e6 [Davies Liu] add repartition 6bf2b73 [Davies Liu] fix collect with UDT and tests e971078 [Reynold Xin] Missing quotes. b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now. a728bf2 [Reynold Xin] Example rename. e8aa3d3 [Reynold Xin] groupby -> groupBy. 9662c9e [Davies Liu] improve DataFrame Python API 4ae51ea [Davies Liu] python API for dataframe 1e5e454 [Reynold Xin] Fixed a bug with symbol conversion. 2ca74db [Reynold Xin] Couple minor fixes. ea98ea1 [Reynold Xin] Documentation & literal expressions. 2b22684 [Reynold Xin] Got rid of IntelliJ problems. 02bbfbc [Reynold Xin] Tightening imports. ffbce66 [Reynold Xin] Fixed compilation error. 59b6d8b [Reynold Xin] Style violation. b85edfb [Reynold Xin] ALS. 8c37f0a [Reynold Xin] Made MLlib and examples compile 6d53134 [Reynold Xin] Hive module. d35efd5 [Reynold Xin] Fixed compilation error. ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite. 66d5ef1 [Reynold Xin] SQLContext minor patch. c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
* [SPARK-5202] [SQL] Add hql variable substitution supportCheng Hao2015-01-212-2/+22
| | | | | | | | | | | | | https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, it impacts the existed hql scripts from Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #4003 from chenghao-intel/substitution and squashes the following commits: bb41fd6 [Cheng Hao] revert the removed the implicit conversion af7c31a [Cheng Hao] add hql variable substitution support
* [SQL] [Minor] Remove deprecated parquet testsCheng Lian2015-01-213-1289/+212
| | | | | | | | | | | | | | | | This PR removes the deprecated `ParquetQuerySuite`, renamed `ParquetQuerySuite2` to `ParquetQuerySuite`, and refactored changes introduced in #4115 to `ParquetFilterSuite` . It is a follow-up of #3644. Notice that test cases in the old `ParquetQuerySuite` have already been well covered by other test suites introduced in #3644. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4116) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4116 from liancheng/remove-deprecated-parquet-tests and squashes the following commits: f73b8f9 [Cheng Lian] Removes deprecated Parquet test suite
* Revert "[SPARK-5244] [SQL] add coalesce() in sql parser"Josh Rosen2015-01-212-11/+0
| | | | This reverts commit 812d3679f5f97df7b667cbc3365a49866ebc02d5.
* [SPARK-5009] [SQL] Long keyword support in SQL ParsersCheng Hao2015-01-218-81/+128
| | | | | | | | | | | * The `SqlLexical.allCaseVersions` will cause `StackOverflowException` if the key word is too long, the patch will fix that by normalizing all of the keywords in `SqlLexical`. * And make a unified SparkSQLParser for sharing the common code. Author: Cheng Hao <hao.cheng@intel.com> Closes #3926 from chenghao-intel/long_keyword and squashes the following commits: 686660f [Cheng Hao] Support Long Keyword and Refactor the SQLParsers
* [SPARK-5244] [SQL] add coalesce() in sql parserDaoyuan Wang2015-01-212-0/+11
| | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4040 from adrian-wang/coalesce and squashes the following commits: 0ac8e8f [Daoyuan Wang] add coalesce() in sql parser
* [SPARK-5323][SQL] Remove Row's Seq inheritance.Reynold Xin2015-01-2047-956/+1018
| | | | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4115 from rxin/row-seq and squashes the following commits: e33abd8 [Reynold Xin] Fixed compilation error. cceb650 [Reynold Xin] Python test fixes, and removal of WrapDynamic. 0334a52 [Reynold Xin] mkString. 9cdeb7d [Reynold Xin] Hive tests. 15681c2 [Reynold Xin] Fix more test cases. ea9023a [Reynold Xin] Fixed a catalyst test. c5e2cb5 [Reynold Xin] Minor patch up. b9cab7c [Reynold Xin] [SPARK-5323][SQL] Remove Row's Seq inheritance.
* [SPARK-5287][SQL] Add defaultSizeOf to every data type.Yin Huai2015-01-205-48/+201
| | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-5287 This PR only add `defaultSizeOf` to data types and make those internal type classes `protected[sql]`. I will use another PR to cleanup the type hierarchy of data types. Author: Yin Huai <yhuai@databricks.com> Closes #4081 from yhuai/SPARK-5287 and squashes the following commits: 90cec75 [Yin Huai] Update unit test. e1c600c [Yin Huai] Make internal classes protected[sql]. 7eaba68 [Yin Huai] Add `defaultSize` method to data types. fd425e0 [Yin Huai] Add all native types to NativeType.defaultSizeOf.
* [SQL][Minor] Refactors deeply nested FP style code in BooleanSimplificationCheng Lian2015-01-202-37/+57
| | | | | | | | | | | | | | | | This is a follow-up of #4090. The original deeply nested `reduceOption` code is hard to grasp. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4091) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4091 from liancheng/refactor-boolean-simplification and squashes the following commits: cd8860b [Cheng Lian] Improves `compareConditions` to handle more subtle cases 1bf3258 [Cheng Lian] Avoids converting predicate sets to lists e833ca4 [Cheng Lian] Refactors deeply nested FP style code
* [SQL][minor] Add a log4j file for catalyst test.Reynold Xin2015-01-201-0/+28
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4117 from rxin/catalyst-test-log4j and squashes the following commits: 8ad610b [Reynold Xin] [SQL][minor] Add a log4j file for catalyst test.
* [SPARK-5286][SQL] Fail to drop an invalid table when using the data source APIYin Huai2015-01-192-0/+18
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-5286 Author: Yin Huai <yhuai@databricks.com> Closes #4076 from yhuai/SPARK-5286 and squashes the following commits: 6b69ed1 [Yin Huai] Catch all exception when we try to uncache a query.
* [SPARK-5284][SQL] Insert into Hive throws NPE when a inner complex type ↵Yin Huai2015-01-192-9/+54
| | | | | | | | | | | | field has a null value JIRA: https://issues.apache.org/jira/browse/SPARK-5284 Author: Yin Huai <yhuai@databricks.com> Closes #4077 from yhuai/SPARK-5284 and squashes the following commits: fceacd6 [Yin Huai] Check if a value is null when the field has a complex type.
* [SQL] fix typo in class descriptionJacky Li2015-01-181-3/+3
| | | | | | | | | | Author: Jacky Li <jacky.likun@gmail.com> Closes #4100 from jackylk/patch-9 and squashes the following commits: b13b9d6 [Jacky Li] Update SQLConf.scala 4d3f83d [Jacky Li] Update SQLConf.scala fcc8c85 [Jacky Li] [SQL] fix typo in class description
* [SQL][minor] Put DataTypes.java in java dir.Reynold Xin2015-01-181-0/+0
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4097 from rxin/javarename and squashes the following commits: c5ce96a [Reynold Xin] [SQL][minor] Put DataTypes.java in java dir.
* [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.Reynold Xin2015-01-1827-77/+101
| | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4092 from rxin/bigdecimal and squashes the following commits: 27b08c9 [Reynold Xin] Fixed test. 10cb496 [Reynold Xin] [SPARK-5279][SQL] Use java.math.BigDecimal as the exposed Decimal type.
* [SQL][Minor] Added comments and examples to explain BooleanSimplificationReynold Xin2015-01-171-83/+94
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4090 from rxin/booleanSimplification and squashes the following commits: 68c8986 [Reynold Xin] [SQL][Minor] Added comments and examples to explain BooleanSimplification.
* [SPARK-4937][SQL] Comment for the newly optimization rules in ↵scwf2015-01-171-2/+16
| | | | | | | | | | | | | | `BooleanSimplification` Follow up of #3778 /cc rxin Author: scwf <wangfei1@huawei.com> Closes #4086 from scwf/commentforspark-4937 and squashes the following commits: aaf89f6 [scwf] code style issue 2d3406e [scwf] added comment for spark-4937
* [SQL][minor] Improved Row documentation.Reynold Xin2015-01-171-52/+114
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4085 from rxin/row-doc and squashes the following commits: f77cb27 [Reynold Xin] [SQL][minor] Improved Row documentation.
* [SPARK-5193][SQL] Remove Spark SQL Java-specific API.Reynold Xin2015-01-1615-1350/+46
| | | | | | | | | | | | | | | | | | | | | | | | | After the following patches, the main (Scala) API is now usable for Java users directly. https://github.com/apache/spark/pull/4056 https://github.com/apache/spark/pull/4054 https://github.com/apache/spark/pull/4049 https://github.com/apache/spark/pull/4030 https://github.com/apache/spark/pull/3965 https://github.com/apache/spark/pull/3958 Author: Reynold Xin <rxin@databricks.com> Closes #4065 from rxin/sql-java-api and squashes the following commits: b1fd860 [Reynold Xin] Fix Mima 6d86578 [Reynold Xin] Ok one more attempt in fixing Python... e8f1455 [Reynold Xin] Fix Python again... 3e53f91 [Reynold Xin] Fixed Python. 83735da [Reynold Xin] Fix BigDecimal test. e9f1de3 [Reynold Xin] Use scala BigDecimal. 500d2c4 [Reynold Xin] Fix Decimal. ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory. c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
* [SPARK-4937][SQL] Adding optimization to simplify the And, Or condition in ↵scwf2015-01-163-90/+131
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spark sql Adding optimization to simplify the And/Or condition in spark sql. There are two kinds of Optimization 1 Numeric condition optimization, such as: a < 3 && a > 5 ---- False a < 1 || a > 0 ---- True a > 3 && a > 5 => a > 5 (a < 2 || b > 5) && a < 2 => a < 2 2 optimizing the some query from a cartesian product into equi-join, such as this sql (one of hive-testbench): ``` select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#32' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 7 and l_quantity <= 7 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#35' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 15 and l_quantity <= 15 + 10 and p_size between 1 and 10 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#24' and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >= 26 and l_quantity <= 26 + 10 and p_size between 1 and 15 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) ``` It has a repeated expression in Or, so we can optimize it by ``` (a && b) || (a && c) = a && (b || c)``` Before optimization, this sql hang in my locally test, and the physical plan is: ![image](https://cloud.githubusercontent.com/assets/7018048/5539175/31cf38e8-8af9-11e4-95e3-336f9b3da4a4.png) After optimization, this sql run successfully in 20+ seconds, and its physical plan is: ![image](https://cloud.githubusercontent.com/assets/7018048/5539176/39a558e0-8af9-11e4-912b-93de94b20075.png) This PR focus on the second optimization and some simple ones of the first. For complex Numeric condition optimization, I will make a follow up PR. Author: scwf <wangfei1@huawei.com> Author: wangfei <wangfei1@huawei.com> Closes #3778 from scwf/filter1 and squashes the following commits: 58bcbc2 [scwf] minor format fix 9570211 [scwf] conflicts fix 527e6ce [scwf] minor comment improvements 5c6f134 [scwf] remove numeric optimizations and move to BooleanSimplification 546a82b [wangfei] style fix 825fa69 [wangfei] adding more tests a001e8c [wangfei] revert pom changes 32a595b [scwf] improvement and test fix e99a26c [wangfei] refactory And/Or optimization to make it more readable and clean
* [SPARK-5274][SQL] Reconcile Java and Scala UDFRegistration.Reynold Xin2015-01-156-70/+666
| | | | | | | | | | | | | | | | | | | | As part of SPARK-5193: 1. Removed UDFRegistration as a mixin in SQLContext and made it a field ("udf"). 2. For Java UDFs, renamed dataType to returnType. 3. For Scala UDFs, added type tags. 4. Added all Java UDF registration methods to Scala's UDFRegistration. 5. Documentation Author: Reynold Xin <rxin@databricks.com> Closes #4056 from rxin/udf-registration and squashes the following commits: ae9c556 [Reynold Xin] Updated example. 675a3c9 [Reynold Xin] Style fix 47c24ff [Reynold Xin] Python fix. 5f00c45 [Reynold Xin] Restore data type position in java udf and added typetags. 032f006 [Reynold Xin] [SPARK-5193][SQL] Reconcile Java and Scala UDFRegistration.
* [SPARK-5193][SQL] Tighten up HiveContext APIReynold Xin2015-01-141-35/+13
| | | | | | | | | | | | | | 1. Removed the deprecated LocalHiveContext 2. Made private[sql] fields protected[sql] so they don't show up in javadoc. 3. Added javadoc to refreshTable. 4. Added Experimental tag to analyze command. Author: Reynold Xin <rxin@databricks.com> Closes #4054 from rxin/hivecontext-api and squashes the following commits: 25cc00a [Reynold Xin] Add implicit conversion back. cbca886 [Reynold Xin] [SPARK-5193][SQL] Tighten up HiveContext API
* [SPARK-5193][SQL] Tighten up SQLContext APIReynold Xin2015-01-1410-281/+152
| | | | | | | | | | | | | | | | 1. Removed 2 implicits (logicalPlanToSparkQuery and baseRelationToSchemaRDD) 2. Moved extraStrategies into ExperimentalMethods. 3. Made private methods protected[sql] so they don't show up in javadocs. 4. Removed createParquetFile. 5. Added Java version of applySchema to SQLContext. Author: Reynold Xin <rxin@databricks.com> Closes #4049 from rxin/sqlContext-refactor and squashes the following commits: a326a1a [Reynold Xin] Remove createParquetFile and add applySchema for Java to SQLContext. ecd6685 [Reynold Xin] Added baseRelationToSchemaRDD back. 4a38c9b [Reynold Xin] [SPARK-5193][SQL] Tighten up SQLContext API