aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [maven-release-plugin] prepare release v1.0.1-rc2v1.0.1Ubuntu2014-07-043-3/+3
|
* [SPARK-2059][SQL] Add analysis checksReynold Xin2014-07-042-0/+24
| | | | | | | | | | | | | | | | This replaces #1263 with a test case. Author: Reynold Xin <rxin@apache.org> Author: Michael Armbrust <michael@databricks.com> Closes #1265 from rxin/sql-analysis-error and squashes the following commits: a639e01 [Reynold Xin] Added a test case for unresolved attribute analysis. 7371e1b [Reynold Xin] Merge pull request #1263 from marmbrus/analysisChecks 448c088 [Michael Armbrust] Add analysis checks (cherry picked from commit b3e768e154bd7175db44c3ffc3d8f783f15ab776) Signed-off-by: Reynold Xin <rxin@apache.org>
* Update SQLConf.scalabaishuo(白硕)2014-07-041-6/+3
| | | | | | | | | | | | | | | | | | | | | use concurrent.ConcurrentHashMap instead of util.Collections.synchronizedMap Author: baishuo(白硕) <vc_java@hotmail.com> Closes #1272 from baishuo/master and squashes the following commits: 51ec55d [baishuo(白硕)] Update SQLConf.scala 63da043 [baishuo(白硕)] Update SQLConf.scala 36b6dbd [baishuo(白硕)] Update SQLConf.scala 864faa0 [baishuo(白硕)] Update SQLConf.scala 593096b [baishuo(白硕)] Update SQLConf.scala 7304d9b [baishuo(白硕)] Update SQLConf.scala 843581c [baishuo(白硕)] Update SQLConf.scala 1d3e4a2 [baishuo(白硕)] Update SQLConf.scala 0740f28 [baishuo(白硕)] Update SQLConf.scala (cherry picked from commit 0bbe61223eda3f33bbf8992d2a8f0d47813f4873) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2059][SQL] Don't throw TreeNodeException in `execution.ExplainCommand`Cheng Lian2014-07-031-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a fix for the problem revealed by PR #1265. Currently `HiveComparisonSuite` ignores output of `ExplainCommand` since Catalyst query plan is quite different from Hive query plan. But exceptions throw from `CheckResolution` still breaks test cases. This PR catches any `TreeNodeException` and reports it as part of the query explanation. After merging this PR, PR #1265 can also be merged safely. For a normal query: ``` scala> hql("explain select key from src").foreach(println) ... [Physical execution plan:] [HiveTableScan [key#9], (MetastoreRelation default, src, None), None] ``` For a wrong query with unresolved attribute(s): ``` scala> hql("explain select kay from src").foreach(println) ... [Error occurred during query planning: ] [Unresolved attributes: 'kay, tree:] [Project ['kay]] [ LowerCaseSchema ] [ MetastoreRelation default, src, None] ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1294 from liancheng/safe-explain and squashes the following commits: 4318911 [Cheng Lian] Don't throw TreeNodeException in `execution.ExplainCommand` (cherry picked from commit 544880457de556d1ad52e8cb7e1eca19da95f517) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2342] Evaluation helper's output type doesn't conform to input ty...Yijie Shen2014-07-031-1/+1
| | | | | | | | | | | | | The function cast doesn't conform to the intention of "Those expressions are supposed to be in the same data type, and also the return type." comment Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #1283 from yijieshen/master and squashes the following commits: c7aaa4b [Yijie Shen] [SPARK-2342] Evaluation helper's output type doesn't conform to input type (cherry picked from commit a9b52e5623f7fc77fca96b095f9eeaef76e35d54) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2287] [SQL] Make ScalaReflection be able to handle Generic case classes.Takuya UESHIN2014-07-022-2/+25
| | | | | | | | | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1226 from ueshin/issues/SPARK-2287 and squashes the following commits: 32ef7c3 [Takuya UESHIN] Add execution of `SHOW TABLES` before `TestHive.reset()`. 541dc8d [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2287 fac5fae [Takuya UESHIN] Remove unnecessary method receiver. d306e60 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2287 7de5706 [Takuya UESHIN] Make ScalaReflection be able to handle Generic case classes. (cherry picked from commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2328] [SQL] Add execution of `SHOW TABLES` before `TestHive.reset()`.Takuya UESHIN2014-07-021-0/+3
| | | | | | | | | | | | | | `PruningSuite` is executed first of Hive tests unfortunately, `TestHive.reset()` breaks the test environment. To prevent this, we must run a query before calling reset the first time. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1268 from ueshin/issues/SPARK-2328 and squashes the following commits: 043ceac [Takuya UESHIN] Add execution of `SHOW TABLES` before `TestHive.reset()`. (cherry picked from commit 1e2c26c83dd2e807cf0031ceca8b338a1a57cac6) Signed-off-by: Michael Armbrust <michael@databricks.com>
* SPARK-2186: Spark SQL DSL support for simple aggregations such as SUM and AVGXimo Guanter Gonzalbez2014-07-023-8/+44
| | | | | | | | | | | | | | | **Description** This patch enables using the `.select()` function in SchemaRDD with functions such as `Sum`, `Count` and other. **Testing** Unit tests added. Author: Ximo Guanter Gonzalbez <ximo@tid.es> Closes #1211 from edrevo/add-expression-support-in-select and squashes the following commits: fe4a1e1 [Ximo Guanter Gonzalbez] Extend SQL DSL to functions e1d344a [Ximo Guanter Gonzalbez] SPARK-2186: Spark SQL DSL support for simple aggregations such as SUM and AVG (cherry picked from commit 5c6ec94da1bacd8e65a43acb92b6721493484e7b) Signed-off-by: Michael Armbrust <michael@databricks.com>
* update the comments in SqlParserCodingCat2014-07-011-1/+0
| | | | | | | | | | | | | SqlParser has been case-insensitive after https://github.com/apache/spark/commit/dab5439a083b5f771d5d5b462d0d517fa8e9aaf2 was merged Author: CodingCat <zhunansjtu@gmail.com> Closes #1275 from CodingCat/master and squashes the following commits: 17931cd [CodingCat] update the comments in SqlParser (cherry picked from commit 6596392da0fc0fee89e22adfca239a3477dfcbab) Signed-off-by: Reynold Xin <rxin@apache.org>
* Revert "[maven-release-plugin] prepare release v1.0.1-rc1"Patrick Wendell2014-06-273-3/+3
| | | | This reverts commit 7feeda3d729f9397aa15ee8750c01ef5aa601962.
* Revert "[maven-release-plugin] prepare for next development iteration"Patrick Wendell2014-06-273-3/+3
| | | | This reverts commit ea1a455a755f83f46fc8bf242410917d93d0c52c.
* [maven-release-plugin] prepare for next development iterationUbuntu2014-06-263-3/+3
|
* [maven-release-plugin] prepare release v1.0.1-rc1Ubuntu2014-06-263-3/+3
|
* [SPARK-2295] [SQL] Make JavaBeans nullability stricter.Takuya UESHIN2014-06-261-19/+18
| | | | | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1235 from ueshin/issues/SPARK-2295 and squashes the following commits: 201c508 [Takuya UESHIN] Make JavaBeans nullability stricter. (cherry picked from commit 32a1ad75313472b1b098f7ec99335686d3fe4fc3) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2254] [SQL] ScalaRefection should mark primitive types as non-nullable.Takuya UESHIN2014-06-252-31/+165
| | | | | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1193 from ueshin/issues/SPARK-2254 and squashes the following commits: cfd6088 [Takuya UESHIN] Modify ScalaRefection.schemaFor method to return nullability of Scala Type. (cherry picked from commit e4899a253728bfa7c78709a37a4837f74b72bd61) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2283][SQL] Reset test environment before running PruningSuiteCheng Lian2014-06-251-0/+5
| | | | | | | | | | | | | | | JIRA issue: [SPARK-2283](https://issues.apache.org/jira/browse/SPARK-2283) If `PruningSuite` is run right after `HiveCompatibilitySuite`, the first test case fails because `srcpart` table is cached in-memory by `HiveCompatibilitySuite`, but column pruning is not implemented for `InMemoryColumnarTableScan` operator yet. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1221 from liancheng/spark-2283 and squashes the following commits: dc0b663 [Cheng Lian] SPARK-2283: reset test environment before running PruningSuite (cherry picked from commit 7f196b009d26d4aed403b3c694f8b603601718e3) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [BUGFIX][SQL] Should match java.math.BigDecimal when wnrapping Hive outputCheng Lian2014-06-251-4/+4
| | | | | | | | | | | | | The `BigDecimal` branch in `unwrap` matches to `scala.math.BigDecimal` rather than `java.math.BigDecimal`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1199 from liancheng/javaBigDecimal and squashes the following commits: e9bb481 [Cheng Lian] Should match java.math.BigDecimal when wnrapping Hive output (cherry picked from commit 22036aeb1b2cac7f48cd60afea925b42a5318631) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2263][SQL] Support inserting MAP<K, V> to Hive tablesCheng Lian2014-06-253-6/+20
| | | | | | | | | | | | | | | | JIRA issue: [SPARK-2263](https://issues.apache.org/jira/browse/SPARK-2263) Map objects were not converted to Hive types before inserting into Hive tables. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1205 from liancheng/spark-2263 and squashes the following commits: c7a4373 [Cheng Lian] Addressed @concretevitamin's comment 784940b [Cheng Lian] SARPK-2263: support inserting MAP<K, V> to Hive tables (cherry picked from commit 8fade8973e5fc97f781de5344beb66b90bd6e524) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2264][SQL] Fix failing CachedTableSuiteMichael Armbrust2014-06-243-24/+25
| | | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #1201 from marmbrus/fixCacheTests and squashes the following commits: 9d87ed1 [Michael Armbrust] Use analyzer (which runs to fixed point) instead of manually removing analysis operators. Conflicts: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
* [SQL]Add base row updating methods for JoinedRowCheng Hao2014-06-241-0/+17
| | | | | | | | | | | | | This will be helpful in join operators. Author: Cheng Hao <hao.cheng@intel.com> Closes #1187 from chenghao-intel/joinedRow and squashes the following commits: 87c19e3 [Cheng Hao] Add base row set methods for JoinedRow (cherry picked from commit 133495d82672c3f34d40a6298cc80c31f91faf5c) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2227] Support dfs command in SQL.Reynold Xin2014-06-231-8/+6
| | | | | | | | | | | | | | | Note that nothing gets printed to the console because we don't properly maintain session state right now. I will have a followup PR that fixes it. Author: Reynold Xin <rxin@apache.org> Closes #1167 from rxin/commands and squashes the following commits: 56f04f8 [Reynold Xin] [SPARK-2227] Support dfs command in SQL. (cherry picked from commit 51c8168377a89d20d0b2d7b9a28af58593a0fe0c) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-1669][SQL] Made cacheTable idempotentCheng Lian2014-06-232-4/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA issue: [SPARK-1669](https://issues.apache.org/jira/browse/SPARK-1669) Caching the same table multiple times should end up with only 1 in-memory columnar representation of this table. Before: ``` scala> loadTestTable("src") ... scala> cacheTable("src") ... scala> cacheTable("src") ... scala> table("src") ... == Query Plan == InMemoryColumnarTableScan [key#2,value#3], (InMemoryRelation [key#2,value#3], false, (InMemoryColumnarTableScan [key#2,value#3], (InMemoryRelation [key#2,value#3], false, (HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None)))) ``` After: ``` scala> loadTestTable("src") ... scala> cacheTable("src") ... scala> cacheTable("src") ... scala> table("src") ... == Query Plan == InMemoryColumnarTableScan [key#2,value#3], (InMemoryRelation [key#2,value#3], false, (HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None)) ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1183 from liancheng/spark-1669 and squashes the following commits: 68f8a20 [Cheng Lian] Removed an unused import 51bae90 [Cheng Lian] Made cacheTable idempotent (cherry picked from commit a4bc442ca2c35444de8a33376b6f27c6c2a9003d) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SQL] Break hiveOperators.scala into multiple files.Reynold Xin2014-06-216-529/+610
| | | | | | | | | | | | | The single file was getting very long (500+ loc). Author: Reynold Xin <rxin@apache.org> Closes #1166 from rxin/hiveOperators and squashes the following commits: 5b43068 [Reynold Xin] [SQL] Break hiveOperators.scala into multiple files. (cherry picked from commit ec935abce13b60f353236566da149c0c87bb1002) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SQL] Pass SQLContext instead of SparkContext into physical operators.Reynold Xin2014-06-207-44/+51
| | | | | | | | | | | | | This makes it easier to use config options in operators. Author: Reynold Xin <rxin@apache.org> Closes #1164 from rxin/sqlcontext and squashes the following commits: 797b2fd [Reynold Xin] Pass SQLContext instead of SparkContext into physical operators. (cherry picked from commit ca5d8b5904dc6dd5b691af506d3a842e508b3673) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SQL] Use hive.SessionState, not the thread local SessionStateAaron Davidson2014-06-201-1/+1
| | | | | | | | | | | | | Note that this is simply mimicing lookupRelation(). I do not have a concrete notion of why this solution is necessarily right-er than SessionState.get, but SessionState.get is returning null, which is bad. Author: Aaron Davidson <aaron@databricks.com> Closes #1148 from aarondav/createtable and squashes the following commits: 37c3e7c [Aaron Davidson] [SQL] Use hive.SessionState, not the thread local SessionState (cherry picked from commit 2044784915554a890ca6f8450d8403495b2ee4f3) Signed-off-by: Reynold Xin <rxin@apache.org>
* Move ScriptTransformation into the appropriate place.Reynold Xin2014-06-201-0/+0
| | | | | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #1162 from rxin/script and squashes the following commits: 2c836b9 [Reynold Xin] Move ScriptTransformation into the appropriate place. (cherry picked from commit d4c7572dba1be49e55ceb38713652e5bcf485be8) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2225] Turn HAVING without GROUP BY into WHERE.Reynold Xin2014-06-202-23/+11
| | | | | | | | | | | | | @willb Author: Reynold Xin <rxin@apache.org> Closes #1161 from rxin/having-filter and squashes the following commits: fa8359a [Reynold Xin] [SPARK-2225] Turn HAVING without GROUP BY into WHERE. (cherry picked from commit 0ac71d1284cd4f011d5763181cba9ecb49337b66) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-2180: support HAVING clauses in Hive queriesWilliam Benton2014-06-202-6/+53
| | | | | | | | | | | | | | | | | | This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations. The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`. Author: William Benton <willb@redhat.com> Closes #1136 from willb/SPARK-2180 and squashes the following commits: 3bbaf26 [William Benton] Added casts to HAVING expressions 83f1340 [William Benton] scalastyle fixes 18387f1 [William Benton] Add test for HAVING without GROUP BY b880bef [William Benton] Added semantic error for HAVING without GROUP BY 942428e [William Benton] Added test coverage for SPARK-2180. 56084cc [William Benton] Add support for HAVING clauses in Hive queries. (cherry picked from commit 171ebb3a824a577d69443ec68a3543b27914cf6d) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2218] rename Equals to EqualTo in Spark SQL expressions.Reynold Xin2014-06-2011-40/+38
| | | | | | | | | | | | | | | | | | | Due to the existence of scala.Equals, it is very error prone to name the expression Equals, especially because we use a lot of partial functions and pattern matching in the optimizer. Note that this sits on top of #1144. Author: Reynold Xin <rxin@apache.org> Closes #1146 from rxin/equals and squashes the following commits: f8583fd [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals 326b388 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals bd19807 [Reynold Xin] Rename EqualsTo to EqualTo. 81148d1 [Reynold Xin] [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions. c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed. (cherry picked from commit 2f6a835e1a039a0b1ba6e184b3350444b70f91df) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2196] [SQL] Fix nullability of CaseWhen.Takuya UESHIN2014-06-202-1/+46
| | | | | | | | | | | | | | | `CaseWhen` should use `branches.length` to check if `elseValue` is provided or not. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1133 from ueshin/issues/SPARK-2196 and squashes the following commits: 510f12d [Takuya UESHIN] Add some tests. dc25e8d [Takuya UESHIN] Fix nullable of CaseWhen to be nullable if the elseValue is nullable. 4f049cc [Takuya UESHIN] Fix nullability of CaseWhen. (cherry picked from commit 324952892085d1933bcf392ce8f2ced452fe741e) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2209][SQL] Cast shouldn't do null check twice.Reynold Xin2014-06-201-115/+159
| | | | | | | | | | | | | | Also took the chance to clean up cast a little bit. Too many arrows on each line before! Author: Reynold Xin <rxin@apache.org> Closes #1143 from rxin/cast and squashes the following commits: dd006cb [Reynold Xin] Code review feedback. c2b88ae [Reynold Xin] [SPARK-2209][SQL] Cast shouldn't do null check twice. (cherry picked from commit c55bbb49f7ec653f0ff635015d3bc789ca26c4eb) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2210] cast to boolean on boolean value gets turned into ↵Reynold Xin2014-06-192-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NOT((boolean_condition) = 0) ``` explain select cast(cast(key=0 as boolean) as boolean) aaa from src ``` should be ``` [Physical execution plan:] [Project [(key#10:0 = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` However, it is currently ``` [Physical execution plan:] [Project [NOT((key#10=0) = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` Author: Reynold Xin <rxin@apache.org> Closes #1144 from rxin/booleancast and squashes the following commits: c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed. (cherry picked from commit 61756409736a64bd42577782cb7468557fa0b642) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-1293 [SQL] Parquet support for nested typesAndre Schumacher2014-06-1914-384/+2102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It should be possible to import and export data stored in Parquet's columnar format that contains nested types. For example: ```java message AddressBook { required binary owner; optional group ownerPhoneNumbers { repeated binary array; } optional group contacts { repeated group array { required binary name; optional binary phoneNumber; } } optional group nameToApartmentNumber { repeated group map { required binary key; required int32 value; } } } ``` The example could model a type (AddressBook) that contains records made of strings (owner), lists (ownerPhoneNumbers) and a table of contacts (e.g., a list of pairs or a map that can contain null values but keys must not be null). The list of tasks are as follows: <h6>Implement support for converting nested Parquet types to Spark/Catalyst types:</h6> - [x] Structs - [x] Lists - [x] Maps (note: currently keys need to be Strings) <h6>Implement import (via ``parquetFile``) of nested Parquet types (first version in this PR)</h6> - [x] Initial version <h6>Implement export (via ``saveAsParquetFile``)</h6> - [x] Initial version <h6>Test support for AvroParquet, etc.</h6> - [x] Initial testing of import of avro-generated Parquet data (simple + nested) Example: ```scala val data = TestSQLContext .parquetFile("input.dir") .toSchemaRDD data.registerAsTable("data") sql("SELECT owner, contacts[1].name, nameToApartmentNumber['John'] FROM data").collect() ``` Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Michael Armbrust <michael@databricks.com> Closes #360 from AndreSchumacher/nested_parquet and squashes the following commits: 30708c8 [Andre Schumacher] Taking out AvroParquet test for now to remove Avro dependency 95c1367 [Andre Schumacher] Changes to ParquetRelation and its metadata 7eceb67 [Andre Schumacher] Review feedback 94eea3a [Andre Schumacher] Scalastyle 403061f [Andre Schumacher] Fixing some issues with tests and schema metadata b8a8b9a [Andre Schumacher] More fixes to short and byte conversion 63d1b57 [Andre Schumacher] Cleaning up and Scalastyle 88e6bdb [Andre Schumacher] Attempting to fix loss of schema 37e0a0a [Andre Schumacher] Cleaning up 14c3fd8 [Andre Schumacher] Attempting to fix Spark-Parquet schema conversion 3e1456c [Michael Armbrust] WIP: Directly serialize catalyst attributes. f7aeba3 [Michael Armbrust] [SPARK-1982] Support for ByteType and ShortType. 3104886 [Michael Armbrust] Nested Rows should be Rows, not Seqs. 3c6b25f [Andre Schumacher] Trying to reduce no-op changes wrt master 31465d6 [Andre Schumacher] Scalastyle: fixing commented out bottom de02538 [Andre Schumacher] Cleaning up ParquetTestData 2f5a805 [Andre Schumacher] Removing stripMargin from test schemas 191bc0d [Andre Schumacher] Changing to Seq for ArrayType, refactoring SQLParser for nested field extension cbb5793 [Andre Schumacher] Code review feedback 32229c7 [Andre Schumacher] Removing Row nested values and placing by generic types 0ae9376 [Andre Schumacher] Doc strings and simplifying ParquetConverter.scala a6b4f05 [Andre Schumacher] Cleaning up ArrayConverter, moving classTag to NativeType, adding NativeRow 431f00f [Andre Schumacher] Fixing problems introduced during rebase c52ff2c [Andre Schumacher] Adding native-array converter 619c397 [Andre Schumacher] Completing Map testcase 79d81d5 [Andre Schumacher] Replacing field names for array and map in WriteSupport f466ff0 [Andre Schumacher] Added ParquetAvro tests and revised Array conversion adc1258 [Andre Schumacher] Optimizing imports e99cc51 [Andre Schumacher] Fixing nested WriteSupport and adding tests 1dc5ac9 [Andre Schumacher] First version of WriteSupport for nested types d1911dc [Andre Schumacher] Simplifying ArrayType conversion f777b4b [Andre Schumacher] Scalastyle 824500c [Andre Schumacher] Adding attribute resolution for MapType b539fde [Andre Schumacher] First commit for MapType a594aed [Andre Schumacher] Scalastyle 4e25fcb [Andre Schumacher] Adding resolution of complex ArrayTypes f8f8911 [Andre Schumacher] For primitive rows fall back to more efficient converter, code reorg 6dbc9b7 [Andre Schumacher] Fixing some problems intruduced during rebase b7fcc35 [Andre Schumacher] Documenting conversions, bugfix, wrappers of Rows ee70125 [Andre Schumacher] fixing one problem with arrayconverter 98219cf [Andre Schumacher] added struct converter 5d80461 [Andre Schumacher] fixing one problem with nested structs and breaking up files 1b1b3d6 [Andre Schumacher] Fixing one problem with nested arrays ddb40d2 [Andre Schumacher] Extending tests for nested Parquet data 745a42b [Andre Schumacher] Completing testcase for nested data (Addressbook( 6125c75 [Andre Schumacher] First working nested Parquet record input 4d4892a [Andre Schumacher] First commit nested Parquet read converters aa688fe [Andre Schumacher] Adding conversion of nested Parquet schemas (cherry picked from commit f479cf3743e416ee08e62806e1b34aff5998ac22) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2177][SQL] describe table result contains only one columnYin Huai2014-06-199-31/+294
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` scala> hql("describe src").collect().foreach(println) [key string None ] [value string None ] ``` The result should contain 3 columns instead of one. This screws up JDBC or even the downstream consumer of the Scala/Java/Python APIs. I am providing a workaround. We handle a subset of describe commands in Spark SQL, which are defined by ... ``` DESCRIBE [EXTENDED] [db_name.]table_name ``` All other cases are treated as Hive native commands. Also, if we upgrade Hive to 0.13, we need to check the results of context.sessionState.isHiveServerQuery() to determine how to split the result. This method is introduced by https://issues.apache.org/jira/browse/HIVE-4545. We may want to set Hive to use JsonMetaDataFormatter for the output of a DDL statement (`set hive.ddl.output.format=json` introduced by https://issues.apache.org/jira/browse/HIVE-2822). The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2177 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1118 from yhuai/SPARK-2177 and squashes the following commits: fd2534c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 b9b9aa5 [Yin Huai] rxin's comments. e7c4e72 [Yin Huai] Fix unit test. 656b068 [Yin Huai] 100 characters. 6387217 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 8003cf3 [Yin Huai] Generate strings with the format like Hive for unit tests. 9787fff [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 440c5af [Yin Huai] rxin's comments. f1a417e [Yin Huai] Update doc. 83adb2f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 366f891 [Yin Huai] Add describe command. 74bd1d4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 342fdf7 [Yin Huai] Split to up to 3 parts. 725e88c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 bb8bbef [Yin Huai] Split every string in the result of a describe command. (cherry picked from commit f397e92eb2986f4436fb9e66777fc652f91d8494) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SQL] Improve Speed of InsertIntoHiveTableMichael Armbrust2014-06-191-4/+10
| | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #1130 from marmbrus/noFunctional and squashes the following commits: ccdb68c [Michael Armbrust] Remove functional programming and Array allocations from fast path in InsertIntoHiveTable. (cherry picked from commit d3b7671c1f9c1eca956fda15fa7573649fd284b3) Signed-off-by: Reynold Xin <rxin@apache.org>
* More minor scaladoc cleanup for Spark SQL.Reynold Xin2014-06-193-23/+21
| | | | | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #1142 from rxin/sqlclean and squashes the following commits: 67a789e [Reynold Xin] More minor scaladoc cleanup for Spark SQL. (cherry picked from commit 278ec8a203c7f1de2716d8284f9bdafa54eee1cb) Signed-off-by: Reynold Xin <rxin@apache.org>
* A few minor Spark SQL Scaladoc fixes.Reynold Xin2014-06-196-61/+57
| | | | | | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #1139 from rxin/sparksqldoc and squashes the following commits: c3049d8 [Reynold Xin] Fixed line length. 66dc72c [Reynold Xin] A few minor Spark SQL Scaladoc fixes. (cherry picked from commit 5464e79175e2fc85e2cadf0dd7c9a45dad028326) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2191][SQL] Make sure InsertIntoHiveTable doesn't execute more than once.Michael Armbrust2014-06-192-1/+11
| | | | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #1129 from marmbrus/doubleCreateAs and squashes the following commits: 9c6d9e4 [Michael Armbrust] Fix typo. 5128fe2 [Michael Armbrust] Make sure InsertIntoHiveTable doesn't execute each time you ask for its result. (cherry picked from commit 777c5958c4088182f9e2daba435ccb413a2f69d7) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2187] Explain should not run the optimizer twice.Reynold Xin2014-06-183-11/+15
| | | | | | | | | | | | | | @yhuai @marmbrus @concretevitamin Author: Reynold Xin <rxin@apache.org> Closes #1123 from rxin/explain and squashes the following commits: def83b0 [Reynold Xin] Update unit tests for explain. a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice. (cherry picked from commit 640c294369f49a7602c33c7c389088aec8a316d3) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2184][SQL] AddExchange isn't idempotentMichael Armbrust2014-06-183-6/+9
| | | | | | | | | | | | | ...redPartitioning. Author: Michael Armbrust <michael@databricks.com> Closes #1122 from marmbrus/fixAddExchange and squashes the following commits: 3417537 [Michael Armbrust] Don't bind partitioning expressions as that breaks comparison with requiredPartitioning. (cherry picked from commit 5ff75c748a27bcfae71759d0e509218f0c5d0200) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2176][SQL] Extra unnecessary exchange operator in the result of an ↵Yin Huai2014-06-182-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | explain command ``` hql("explain select * from src group by key").collect().foreach(println) [ExplainCommand [plan#27:0]] [ Aggregate false, [key#25], [key#25,value#26]] [ Exchange (HashPartitioning [key#25:0], 200)] [ Exchange (HashPartitioning [key#25:0], 200)] [ Aggregate true, [key#25], [key#25]] [ HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None] ``` There are two exchange operators. However, if we do not use explain... ``` hql("select * from src group by key") res4: org.apache.spark.sql.SchemaRDD = SchemaRDD[8] at RDD at SchemaRDD.scala:100 == Query Plan == Aggregate false, [key#8], [key#8,value#9] Exchange (HashPartitioning [key#8:0], 200) Aggregate true, [key#8], [key#8] HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None ``` The plan is fine. The cause of this bug is explained below. When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted. I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`. The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1116 from yhuai/SPARK-2176 and squashes the following commits: 197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan. (cherry picked from commit 587d32012ceeec1e80cec1878312f164cdb76ec8) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQLYin Huai2014-06-1718-35/+1262
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema. (cherry picked from commit d2f4f30b12f99358953e2781957468e2cfe3c916) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN.Zongheng Yang2014-06-1715-8/+290
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2053 This PR adds support for two types of CASE statements present in Hive. The first type is of the form `CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END`, with the semantics like a chain of if statements. The second type is of the form `CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END`, with the semantics like a switch statement on key `a`. Both forms are implemented in `CaseWhen`. [This link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions) contains more detailed descriptions on their semantics. Notes / Open issues: * Please check if any implicit contracts / invariants are broken in the implementations (especially for the operators). I am not very familiar with them and I currently find them tricky to spot. * We should decide whether or not a non-boolean condition is allowed in a branch of `CaseWhen`. Hive throws a `SemanticException` for this situation and I think it'd be good to mimic it -- the question is where in the whole Spark SQL pipeline should we signal an exception for such a query. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1055 from concretevitamin/caseWhen and squashes the following commits: 4226eb9 [Zongheng Yang] Comment. 79d26fc [Zongheng Yang] Merge branch 'master' into caseWhen caf9383 [Zongheng Yang] Update a FIXME. 9d26ab8 [Zongheng Yang] Add @transient marker. 788a0d9 [Zongheng Yang] Implement CastNulls, which fixes udf_case and udf_when. 7ef284f [Zongheng Yang] Refactors: remove redundant passes, improve toString, mark transient. f47ae7b [Zongheng Yang] Modify queries in tests to have shorter golden files. 1c1fbfc [Zongheng Yang] Cleanups per review comments. 7d2b7e2 [Zongheng Yang] Translate CaseKeyWhen to CaseWhen at parsing time. 47d406a [Zongheng Yang] Do toArray once and lazily outside of eval(). bb3d109 [Zongheng Yang] Update scaladoc of a method. aea3195 [Zongheng Yang] Fix bug that branchesArr is not used; remove unused import. 96870a8 [Zongheng Yang] Turn off scalastyle for some comments. 7392f3a [Zongheng Yang] Minor cleanup. 2cf08bb [Zongheng Yang] Merge branch 'master' into caseWhen 9f84b40 [Zongheng Yang] Add golden outputs from Hive. db51a85 [Zongheng Yang] Add allCondBooleans check; uncomment tests. 3f9ef0a [Zongheng Yang] Cleanups and bug fixes (mainly in eval() and resolved). be54bc8 [Zongheng Yang] Rewrite eval() to a low-level implementation. Separate two CASE stmts. f2bcb9d [Zongheng Yang] WIP 5906f75 [Zongheng Yang] WIP efd019b [Zongheng Yang] eval() and toString() bug fixes. 7d81e95 [Zongheng Yang] Clean up resolved. a31d782 [Zongheng Yang] Finish up Case. (cherry picked from commit e243c5ffacd70ecadaf5c91668955dcc8141e060) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2164][SQL] Allow Hive UDF on columns of type structXi Liu2014-06-173-0/+130
| | | | | | | | | | | | | Author: Xi Liu <xil@conviva.com> Closes #796 from xiliu82/sqlbug and squashes the following commits: 328dfc4 [Xi Liu] [Spark SQL] remove a temporary function after test 354386a [Xi Liu] [Spark SQL] add test suite for UDF on struct 8fc6f51 [Xi Liu] [SparkSQL] allow UDF on struct (cherry picked from commit f5a4049e534da3c55e1b495ce34155236dfb6dee) Signed-off-by: Michael Armbrust <michael@databricks.com>
* Minor fix: made "EXPLAIN" output to play well with JDBC output formatCheng Lian2014-06-163-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixed the broken JDBC output. Test from Shark `beeline`: ``` beeline> !connect jdbc:hive2://localhost:10000/ scan complete in 2ms Connecting to jdbc:hive2://localhost:10000/ Enter username for jdbc:hive2://localhost:10000/: lian Enter password for jdbc:hive2://localhost:10000/: Connected to: Hive (version 0.12.0) Driver: Hive (version 0.12.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10000/> 0: jdbc:hive2://localhost:10000/> explain select * from src; +-------------------------------------------------------------------------------+ | plan | +-------------------------------------------------------------------------------+ | ExplainCommand [plan#2:0] | | HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None | +-------------------------------------------------------------------------------+ 2 rows selected (1.386 seconds) ``` Before this change, the output looked something like this: ``` +-------------------------------------------------------------------------------+ | plan | +-------------------------------------------------------------------------------+ | ExplainCommand [plan#2:0] HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None | +-------------------------------------------------------------------------------+ ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1097 from liancheng/multiLineExplain and squashes the following commits: eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format (cherry picked from commit 237b96bc59ab1b54c31d06a5260cd77e1eb96116) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SQL][SPARK-2094] Follow up of PR #1071 for Java APICheng Lian2014-06-165-74/+124
| | | | | | | | | | | | | | | Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1085 from liancheng/spark-2094-java and squashes the following commits: 29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure 92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore" 22aec97 [Cheng Lian] Follow up of PR #1071 for Java API (cherry picked from commit 273afcb254fb5384204c56bdcb3b9b760bcfab3f) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-2010] Support for nested data in PySpark SQLKan Zhang2014-06-161-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA issue https://issues.apache.org/jira/browse/SPARK-2010 This PR adds support for nested collection types in PySpark SQL, including array, dict, list, set, and tuple. Example, ``` >>> from array import array >>> from pyspark.sql import SQLContext >>> sqlCtx = SQLContext(sc) >>> rdd = sc.parallelize([ ... {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}] True >>> rdd = sc.parallelize([ ... {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == \ ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}] True ``` Author: Kan Zhang <kzhang@apache.org> Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits: 1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO 504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL (cherry picked from commit 4fdb491775bb9c4afa40477dc0069ff6fcadfe25) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SQL] Support transforming TreeNodes with Option children.Michael Armbrust2014-06-152-1/+45
| | | | | | | | | | | | | | | | | Thanks goes to @marmbrus for his implementation. Author: Michael Armbrust <michael@databricks.com> Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1074 from concretevitamin/option-treenode and squashes the following commits: ef27b85 [Zongheng Yang] Merge pull request #1 from marmbrus/pr/1074 73133c2 [Michael Armbrust] TreeNodes can't be inner classes. ab78420 [Zongheng Yang] Add a test. 2ccb721 [Michael Armbrust] Add support for transformation of optional children. (cherry picked from commit 269fc62b20ee5f9cd60a8f133c29f662d17071b1) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-2079] Support batching when serializing SchemaRDD to PythonKan Zhang2014-06-141-7/+2
| | | | | | | | | | | | | | Added batching with default batch size 10 in SchemaRDD.javaToPython Author: Kan Zhang <kzhang@apache.org> Closes #1023 from kanzhang/SPARK-2079 and squashes the following commits: 2d1915e [Kan Zhang] [SPARK-2079] Add batching in SchemaRDD.javaToPython 19b0c09 [Kan Zhang] [SPARK-2079] Removing unnecessary wrapping in SchemaRDD.javaToPython (cherry picked from commit 2550533a28382664f8fd294b2caa494d12bfc7c1) Signed-off-by: Reynold Xin <rxin@apache.org>
* [Spark-2137][SQL] Timestamp UDFs brokenYin Huai2014-06-1319-2/+17
| | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-2137 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1081 from yhuai/SPARK-2137 and squashes the following commits: c04f910 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2137 205f17b [Yin Huai] Make Hive UDF wrapper support Timestamp. (cherry picked from commit 891968509105d8d8cf5a608ad9473aeeed747089) Signed-off-by: Reynold Xin <rxin@apache.org>