aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-5709] [SQL] Add EXPLAIN support in DataFrame API for debugging purposeCheng Hao2015-02-105-10/+32
| | | | | | | | Author: Cheng Hao <hao.cheng@intel.com> Closes #4496 from chenghao-intel/df_explain and squashes the following commits: 552aa58 [Cheng Hao] Add explain support for DF
* [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columnsDavies Liu2015-02-108-33/+111
| | | | | | | | | | | | | Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns. Author: Davies Liu <davies@databricks.com> Closes #4498 from davies/create and squashes the following commits: 08469c1 [Davies Liu] remove Scala/Java API for now c80a7a9 [Davies Liu] fix hive test d1bd8f2 [Davies Liu] cleanup applySchema 9526e97 [Davies Liu] createDataFrame from RDD with columns
* [SPARK-5683] [SQL] Avoid multiple json generator createdCheng Hao2015-02-103-16/+29
| | | | | | | | Author: Cheng Hao <hao.cheng@intel.com> Closes #4468 from chenghao-intel/json and squashes the following commits: aeb7801 [Cheng Hao] avoid multiple json generator created
* [SQL] Add an exception for analysis errors.Michael Armbrust2015-02-105-17/+46
| | | | | | | | | | | | | | | Also start from the bottom so we show the first error instead of the top error. Author: Michael Armbrust <michael@databricks.com> Closes #4439 from marmbrus/analysisException and squashes the following commits: 45862a0 [Michael Armbrust] fix hive test a773bba [Michael Armbrust] Merge remote-tracking branch 'origin/master' into analysisException f88079f [Michael Armbrust] update more cases fede90a [Michael Armbrust] newline fbf4bc3 [Michael Armbrust] move to sql 6235db4 [Michael Armbrust] [SQL] Add an exception for analysis errors.
* [SPARK-5658][SQL] Finalize DDL and write support APIsYin Huai2015-02-1023-344/+1116
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-5658 Author: Yin Huai <yhuai@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #4446 from yhuai/writeSupportFollowup and squashes the following commits: f3a96f7 [Yin Huai] davies's comments. 225ff71 [Yin Huai] Use Scala TestHiveContext to initialize the Python HiveContext in Python tests. 2306f93 [Yin Huai] Style. 2091fcd [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup 537e28f [Yin Huai] Correctly clean up temp data. ae4649e [Yin Huai] Fix Python test. 609129c [Yin Huai] Doc format. 92b6659 [Yin Huai] Python doc and other minor updates. cbc717f [Yin Huai] Rename dataSourceName to source. d1c12d3 [Yin Huai] No need to delete the duplicate rule since it has been removed in master. 22cfa70 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup d91ecb8 [Yin Huai] Fix test. 4c76d78 [Yin Huai] Simplify APIs. 3abc215 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup 0832ce4 [Yin Huai] Fix test. 98e7cdb [Yin Huai] Python style. 2bf44ef [Yin Huai] Python APIs. c204967 [Yin Huai] Format a10223d [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup 9ff97d8 [Yin Huai] Add SaveMode to saveAsTable. 9b6e570 [Yin Huai] Update doc. c2be775 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup 99950a2 [Yin Huai] Use Java enum for SaveMode. 4679665 [Yin Huai] Remove duplicate rule. 77d89dc [Yin Huai] Update doc. e04d908 [Yin Huai] Move import and add (Scala-specific) to scala APIs. cf5703d [Yin Huai] Add checkAnswer to Java tests. 7db95ff [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup 6dfd386 [Yin Huai] Add java test. f2f33ef [Yin Huai] Fix test. e702386 [Yin Huai] Apache header. b1e9b1b [Yin Huai] Format. ed4e1b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup af9e9b3 [Yin Huai] DDL and write support API followup. 2a6213a [Yin Huai] Update API names. e6a0b77 [Yin Huai] Update test. 43bae01 [Yin Huai] Remove createTable from HiveContext. 5ffc372 [Yin Huai] Add more load APIs to SQLContext. 5390743 [Yin Huai] Add more save APIs to DataFrame.
* [SQL] Make Options in the data source API CREATE TABLE statements optional.Yin Huai2015-02-102-6/+5
| | | | | | | | | | Users will not need to put `Options()` in a CREATE TABLE statement when there is not option provided. Author: Yin Huai <yhuai@databricks.com> Closes #4515 from yhuai/makeOptionsOptional and squashes the following commits: 1a898d3 [Yin Huai] Make options optional.
* [SPARK-5725] [SQL] Fixes ParquetRelation2.equalsCheng Lian2015-02-101-0/+1
| | | | | | | | | | | | <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4513) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4513 from liancheng/spark-5725 and squashes the following commits: bf6a087 [Cheng Lian] Fixes ParquetRelation2.equals
* [SQL][Minor] correct some commentsSheng, Li2015-02-111-1/+1
| | | | | | | | | | Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com> Author: OopsOutOfMemory <victorshengli@126.com> Closes #4508 from OopsOutOfMemory/cmt and squashes the following commits: d8a68c6 [Sheng, Li] Update ddl.scala f24aeaf [OopsOutOfMemory] correct style
* [SPARK-5686][SQL] Add show current roles command in HiveQlOopsOutOfMemory2015-02-101-0/+1
| | | | | | | | | | show current roles Author: OopsOutOfMemory <victorshengli@126.com> Closes #4471 from OopsOutOfMemory/show_current_role and squashes the following commits: 1c6b210 [OopsOutOfMemory] add show current roles
* [SQL] Add toString to DataFrame/ColumnMichael Armbrust2015-02-107-7/+85
| | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #4436 from marmbrus/dfToString and squashes the following commits: 8a3c35f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into dfToString b72a81b [Michael Armbrust] add toString
* [SPARK-5592][SQL] java.net.URISyntaxException when insert data to a ↵wangfei2015-02-102-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | partitioned table flowing sql get URISyntaxException: ``` create table sc as select * from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s; create table sc_part (key string) partitioned by (ts string) stored as rcfile; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table sc_part partition(ts) select * from sc; ``` java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.<init>(Path.java:172) at org.apache.hadoop.fs.Path.<init>(Path.java:94) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at java.net.URI.checkPath(URI.java:1804) at java.net.URI.<init>(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) Author: wangfei <wangfei1@huawei.com> Author: Fei Wang <wangfei1@huawei.com> Closes #4368 from scwf/SPARK-5592 and squashes the following commits: aa55ef4 [Fei Wang] comments addressed f8f8bb1 [wangfei] added test case f24624f [wangfei] Merge branch 'master' of https://github.com/apache/spark into SPARK-5592 9998177 [wangfei] added test case ea81daf [wangfei] fix URISyntaxException
* [SPARK-5716] [SQL] Support TOK_CHARSETLITERAL in HiveQlDaoyuan Wang2015-02-107-0/+8
| | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4502 from adrian-wang/utf8 and squashes the following commits: 4d7b0ee [Daoyuan Wang] remove useless import 606f981 [Daoyuan Wang] support TOK_CHARSETLITERAL in HiveQl
* [SPARK-5700] [SQL] [Build] Bumps jets3t to 0.9.3 for hadoop-2.3 and ↵Cheng Lian2015-02-101-3/+0
| | | | | | | | | | | | | | | | | | hadoop-2.4 profiles This is a follow-up PR for #4454 and #4484. JetS3t 0.9.2 contains a log4j.properties file inside the artifact and breaks our tests (see SPARK-5696). This is fixed in 0.9.3. This PR also reverts hotfix changes introduced in #4484. The reason is that asking users to configure HiveThriftServer2 logging configurations in hive-log4j.properties can be unintuitive. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4499) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4499 from liancheng/spark-5700 and squashes the following commits: 4f020c7 [Cheng Lian] Bumps jets3t to 0.9.3 for hadoop-2.3 and hadoop-2.4 profiles
* SPARK-5239 [CORE] JdbcRDD throws "java.lang.AbstractMethodError: ↵Sean Owen2015-02-101-3/+3
| | | | | | | | | | | | oracle.jdbc.driver.xxxxxx.isClosed()Z" This is a completion of https://github.com/apache/spark/pull/4033 which was withdrawn for some reason. Author: Sean Owen <sowen@cloudera.com> Closes #4470 from srowen/SPARK-5239.2 and squashes the following commits: 2398bde [Sean Owen] Avoid use of JDBC4-only isClosed()
* [SQL] Remove the duplicated codeCheng Hao2015-02-091-5/+0
| | | | | | | | Author: Cheng Hao <hao.cheng@intel.com> Closes #4494 from chenghao-intel/tiny_code_change and squashes the following commits: 450dfe7 [Cheng Hao] remove the duplicated code
* [SPARK-5469] restructure pyspark.sql into multiple filesDavies Liu2015-02-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the DataTypes moved into pyspark.sql.types The changes can be tracked by `--find-copies-harder -M25` ``` davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master.. 2 5 python/docs/pyspark.ml.rst 0 3 python/docs/pyspark.mllib.rst 10 2 python/docs/pyspark.sql.rst 1 1 python/pyspark/mllib/linalg.py 21 14 python/pyspark/{mllib => sql}/__init__.py 14 2108 python/pyspark/{sql.py => sql/context.py} 10 1772 python/pyspark/{sql.py => sql/dataframe.py} 7 6 python/pyspark/{sql_tests.py => sql/tests.py} 8 1465 python/pyspark/{sql.py => sql/types.py} 4 2 python/run-tests 1 1 sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala ``` Also `git blame -C -C python/pyspark/sql/context.py` to track the history. Author: Davies Liu <davies@databricks.com> Closes #4479 from davies/sql and squashes the following commits: 1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql 2b2b983 [Davies Liu] restructure pyspark.sql
* [SPARK-5648][SQL] support "alter ... unset tblproperties("key")"DoingDone92015-02-091-0/+2
| | | | | | | | | | | | | make hivecontext support "alter ... unset tblproperties("key")" like : alter view viewName unset tblproperties("k") alter table tableName unset tblproperties("k") Author: DoingDone9 <799203320@qq.com> Closes #4424 from DoingDone9/unset and squashes the following commits: 6dd8bee [DoingDone9] support "alter ... unset tblproperties("key")"
* [SPARK-2096][SQL] support dot notation on array of structWenchen Fan2015-02-095-22/+53
| | | | | | | | | | | | | | ~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type. An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~ marmbrus Could you take a look? to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits: 08a228a [Wenchen Fan] support dot notation on array of struct
* [SPARK-5614][SQL] Predicate pushdown through Generate.Lu Yan2015-02-092-1/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now in Catalyst's rules, predicates can not be pushed through "Generate" nodes. Further more, partition pruning in HiveTableScan can not be applied on those queries involves "Generate". This makes such queries very inefficient. In practice, it finds patterns like ```scala Filter(predicate, Generate(generator, _, _, _, grandChild)) ``` and splits the predicate into 2 parts by referencing the generated column from Generate node or not. And a new Filter will be created for those conjuncts can be pushed beneath Generate node. If nothing left for the original Filter, it will be removed. For example, physical plan for query ```sql select len, bk from s_server lateral view explode(len_arr) len_table as len where len > 5 and day = '20150102'; ``` where 'day' is a partition column in metastore is like this in current version of Spark SQL: > Project [len, bk] > > Filter ((len > "5") && "(day = "20150102")") > > Generate explode(len_arr), true, false > > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), None But theoretically the plan should be like this > Project [len, bk] > > Filter (len > "5") > > Generate explode(len_arr), true, false > > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), Some(day = "20150102") Where partition pruning predicates can be pushed to HiveTableScan nodes. Author: Lu Yan <luyan02@baidu.com> Closes #4394 from ianluyan/ppd and squashes the following commits: a67dce9 [Lu Yan] Fix English grammar. 7cea911 [Lu Yan] Revised based on @marmbrus's opinions ffc59fc [Lu Yan] [SPARK-5614][SQL] Predicate pushdown through Generate.
* [SPARK-5696] [SQL] [HOTFIX] Asks HiveThriftServer2 to re-initialize log4j ↵Cheng Lian2015-02-091-0/+3
| | | | | | | | | | | | | | | | | | using Hive configurations In this way, log4j configurations overriden by jets3t-0.9.2.jar can be again overriden by Hive default log4j configurations. This might not be the best solution for this issue since it requires users to use `hive-log4j.properties` rather than `log4j.properties` to initialize `HiveThriftServer2` logging configurations, which can be confusing. The main purpose of this PR is to fix Jenkins PR build. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4484) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4484 from liancheng/spark-5696 and squashes the following commits: df83956 [Cheng Lian] Hot fix: asks HiveThriftServer2 to re-initialize log4j using Hive configurations
* [SQL] Code cleanup.Yin Huai2015-02-091-3/+0
| | | | | | | | | | | | I added an unnecessary line of code in https://github.com/apache/spark/commit/13531dd97c08563e53dacdaeaf1102bdd13ef825. My bad. Let's delete it. Author: Yin Huai <yhuai@databricks.com> Closes #4482 from yhuai/unnecessaryCode and squashes the following commits: 3645af0 [Yin Huai] Code cleanup.
* [SQL] Add some missing DataFrame functions.Michael Armbrust2015-02-096-21/+51
| | | | | | | | | | | | | - as with a `Symbol` - distinct - sqlContext.emptyDataFrame - move add/remove col out of RDDApi section Author: Michael Armbrust <michael@databricks.com> Closes #4437 from marmbrus/dfMissingFuncs and squashes the following commits: 2004023 [Michael Armbrust] Add missing functions
* [SPARK-5675][SQL] XyzType companion object should subclass XyzTypeReynold Xin2015-02-091-12/+73
| | | | | | | | | | | | | | | Otherwise, the following will always return false in Java. ```scala dataType instanceof StringType ``` Author: Reynold Xin <rxin@databricks.com> Closes #4463 from rxin/type-companion-object and squashes the following commits: 04d5d8d [Reynold Xin] Comment. 976e11e [Reynold Xin] [SPARK-5675][SQL]StringType case object should be subclass of StringType class
* [SPARK-5472][SQL] Fix Scala code styleHung Lin2015-02-082-36/+41
| | | | | | | | | | Fix Scala code style. Author: Hung Lin <hung@zoomdata.com> Closes #4464 from hunglin/SPARK-5472 and squashes the following commits: ef7a3b3 [Hung Lin] SPARK-5472: fix scala style
* [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in ↵Reynold Xin2015-02-088-21/+151
| | | | | | | | | | | | | | | | | | | | | | | tabular format. An example: ``` year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521 ``` Author: Reynold Xin <rxin@databricks.com> Closes #4416 from rxin/SPARK-5643 and squashes the following commits: d0e0d6e [Reynold Xin] [SQL] Minor update to data source and statistics documentation. 269da83 [Reynold Xin] Updated isLocal comment. 2cf3c27 [Reynold Xin] Moved logic into optimizer. 1a04d8b [Reynold Xin] [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in columnar format.
* [SQL] Set sessionState in QueryExecution.Yin Huai2015-02-081-0/+5
| | | | | | | | | | | | | This PR sets the SessionState in HiveContext's QueryExecution. So, we can make sure that SessionState.get can return the SessionState every time. Author: Yin Huai <yhuai@databricks.com> Closes #4445 from yhuai/setSessionState and squashes the following commits: 769c9f1 [Yin Huai] Remove unused import. 439f329 [Yin Huai] Try again. 427a0c9 [Yin Huai] Set SessionState everytime when we create a QueryExecution in HiveContext. a3b7793 [Yin Huai] Set sessionState when dealing with CreateTableAsSelect.
* [SQL] [Minor] HiveParquetSuite was disabled by mistake, re-enable themCheng Lian2015-02-062-4/+10
| | | | | | | | | | | | <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4440) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4440 from liancheng/parquet-oops and squashes the following commits: f21ede4 [Cheng Lian] HiveParquetSuite was disabled by mistake, re-enable them.
* [SQL] Use TestSQLContext in Java testsMichael Armbrust2015-02-062-7/+6
| | | | | | | | | | Sometimes tests were failing due to the creation of multiple `SparkContext`s in a single JVM. Author: Michael Armbrust <michael@databricks.com> Closes #4441 from marmbrus/javaTests and squashes the following commits: 657b1e0 [Michael Armbrust] [SQL] Use TestSQLContext in Java tests
* [SPARK-5278][SQL] Introduce UnresolvedGetField and complete the check of ↵Wenchen Fan2015-02-0612-88/+84
| | | | | | | | | | | | | | | ambiguous reference to fields When the `GetField` chain(`a.b.c.d.....`) is interrupted by `GetItem` like `a.b[0].c.d....`, then the check of ambiguous reference to fields is broken. The reason is that: for something like `a.b[0].c.d`, we first parse it to `GetField(GetField(GetItem(Unresolved("a.b"), 0), "c"), "d")`. Then in `LogicalPlan#resolve`, we resolve `"a.b"` and build a `GetField` chain from bottom(the relation). But for the 2 outer `GetFiled`, we have to resolve them in `Analyzer` or do it in `GetField` lazily, check data type of child, search needed field, etc. which is similar to what we have done in `LogicalPlan#resolve`. So in this PR, the fix is just copy the same logic in `LogicalPlan#resolve` to `Analyzer`, which is simple and quick, but I do suggest introduce `UnresolvedGetFiled` like I explained in https://github.com/apache/spark/pull/2405. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #4068 from cloud-fan/simple and squashes the following commits: a6857b5 [Wenchen Fan] fix import order 8411c40 [Wenchen Fan] use UnresolvedGetField
* [SQL][Minor] Remove cache keyword in SqlParserwangfei2015-02-061-1/+0
| | | | | | | | | | Since cache keyword already defined in `SparkSQLParser` and `SqlParser` of catalyst is a more general parser which should not cover keywords related to underlying compute engine, to remove cache keyword in `SqlParser`. Author: wangfei <wangfei1@huawei.com> Closes #4393 from scwf/remove-cache-keyword and squashes the following commits: 10ade16 [wangfei] remove cache keyword in sql parser
* [SQL][HiveConsole][DOC] HiveConsole `correct hiveconsole imports`OopsOutOfMemory2015-02-061-1/+8
| | | | | | | | | | | | Sorry for that PR #4330 has some mistakes. I correct it.... so it works correctly now. Author: OopsOutOfMemory <victorshengli@126.com> Closes #4389 from OopsOutOfMemory/doc and squashes the following commits: 843eed9 [OopsOutOfMemory] correct hiveconsole imports
* [SPARK-5595][SPARK-5603][SQL] Add a rule to do PreInsert type casting and ↵Yin Huai2015-02-0610-7/+227
| | | | | | | | | | | | | | | | | | | | field renaming and invalidating in memory cache after INSERT This PR adds a rule to Analyzer that will add preinsert data type casting and field renaming to the select clause in an `INSERT INTO/OVERWRITE` statement. Also, with the change of this PR, we always invalidate our in memory data cache after inserting into a BaseRelation. cc marmbrus liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4373 from yhuai/insertFollowUp and squashes the following commits: 08237a7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertFollowUp 316542e [Yin Huai] Doc update. c9ccfeb [Yin Huai] Revert a unnecessary change. 84aecc4 [Yin Huai] Address comments. 1951fe1 [Yin Huai] Merge remote-tracking branch 'upstream/master' c18da34 [Yin Huai] Invalidate cache after insert. 727f21a [Yin Huai] Preinsert casting and renaming.
* [SPARK-5324][SQL] Results of describe can't be queriedOopsOutOfMemory2015-02-064-8/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make below code works. ``` sql("DESCRIBE test").registerTempTable("describeTest") sql("SELECT * FROM describeTest").collect() ``` Author: OopsOutOfMemory <victorshengli@126.com> Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com> Closes #4249 from OopsOutOfMemory/desc_query and squashes the following commits: 6fee13d [OopsOutOfMemory] up-to-date e71430a [Sheng, Li] Update HiveOperatorQueryableSuite.scala 3ba1058 [OopsOutOfMemory] change to default argument aac7226 [OopsOutOfMemory] Merge branch 'master' into desc_query 68eb6dd [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query 354ad71 [OopsOutOfMemory] query describe command d541a35 [OopsOutOfMemory] refine test suite e1da481 [OopsOutOfMemory] refine test suite a780539 [OopsOutOfMemory] Merge branch 'desc_query' of github.com:OopsOutOfMemory/spark into desc_query 0015f82 [OopsOutOfMemory] code style dd0aaef [OopsOutOfMemory] code style c7d606d [OopsOutOfMemory] rename test suite 75f2342 [OopsOutOfMemory] refine code and test suite f942c9b [OopsOutOfMemory] initial 11559ae [OopsOutOfMemory] code style c5fdecf [OopsOutOfMemory] code style aeaea5f [OopsOutOfMemory] rename test suite ac2c3bb [OopsOutOfMemory] refine code and test suite 544573e [OopsOutOfMemory] initial
* [SPARK-5619][SQL] Support 'show roles' in HiveContextq002515982015-02-061-0/+1
| | | | | | | | Author: q00251598 <qiyadong@huawei.com> Closes #4397 from watermen/SPARK-5619 and squashes the following commits: f819b6c [q00251598] Support show roles in HiveContext.
* [SPARK-5640] Synchronize ScalaReflection where necessaryTobias Schlatter2015-02-061-2/+3
| | | | | | | | Author: Tobias Schlatter <tobias@meisch.ch> Closes #4431 from gzm0/sync-scala-refl and squashes the following commits: c5da21e [Tobias Schlatter] [SPARK-5640] Synchronize ScalaReflection where necessary
* [SPARK-5650][SQL] Support optional 'FROM' clauseLiang-Chi Hsieh2015-02-063-5/+18
| | | | | | | | | | In Hive, 'FROM' clause is optional. This pr supports it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4426 from viirya/optional_from and squashes the following commits: fe81f31 [Liang-Chi Hsieh] Support optional 'FROM' clause.
* [SPARK-5639][SQL] Support DataFrame.renameColumn.Reynold Xin2015-02-054-1/+39
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4410 from rxin/df-renameCol and squashes the following commits: a6a796e [Reynold Xin] [SPARK-5639][SQL] Support DataFrame.renameColumn.
* [HOTFIX] [SQL] Disables Metastore Parquet table conversion for ↵Cheng Lian2015-02-051-27/+30
| | | | | | | | | | | | | | | | "SQLQuerySuite.CTAS with serde" Ideally we should convert Metastore Parquet tables with our own Parquet implementation on both read path and write path. However, the write path is not well covered, and causes this test failure. This PR is a hotfix to bring back Jenkins PR builder. A proper fix will be delivered in a follow-up PR. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4413) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4413 from liancheng/hotfix-parquet-ctas and squashes the following commits: 5291289 [Cheng Lian] Hot fix for "SQLQuerySuite.CTAS with serde"
* [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFramesReynold Xin2015-02-053-4/+23
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4408 from rxin/df-config-eager and squashes the following commits: c0204cf [Reynold Xin] [SPARK-5638][SQL] Add a config flag to disable eager analysis of DataFrames.
* [SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data ↵Cheng Lian2015-02-0522-733/+1518
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | source improvements This PR adds three major improvements to Parquet data source: 1. Partition discovery While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types. This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API. Related code in this PR can be easily extracted to the data source API level in future versions. 1. Schema merging When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them. Exceptions are thrown when incompatible schemas are detected. This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default. 1. Metastore Parquet table conversion moved to analysis phase This greatly simplifies the conversion logic. `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future. This version of Parquet data source aims to entirely replace the old Parquet implementation. However, the old version hasn't been removed yet. Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`. Other JIRA tickets fixed as side effects in this PR: - [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types. - [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`. TODO: - [ ] More test cases for partition discovery - [x] Fix write path after data source write support (#4294) is merged It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled. Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now. - [ ] Fix outdated comments and documentations PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes. [1]: https://issues.apache.org/jira/browse/SPARK-5182 [2]: https://issues.apache.org/jira/browse/SPARK-5528 [3]: https://issues.apache.org/jira/browse/SPARK-5509 [4]: https://issues.apache.org/jira/browse/SPARK-3575 <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4308 from liancheng/parquet-partition-discovery and squashes the following commits: b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments 8232e17 [Cheng Lian] Write support for Parquet data source a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider" 808380f [Cheng Lian] Fixes issues introduced while rebasing 50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing 4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method 0d8ec1d [Cheng Lian] Adds more test cases b35c8c6 [Cheng Lian] Fixes some typos and outdated comments dd704fd [Cheng Lian] Fixes Python Parquet API 596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not 7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite 5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging
* [SPARK-5135][SQL] Add support for describe table to DDL in SQLContextOopsOutOfMemory2015-02-0510-28/+190
| | | | | | | | | | | | Hi, rxin marmbrus I considered your suggestion (in #4127) and now re-write it. This is now up-to-date. Could u please review it ? Author: OopsOutOfMemory <victorshengli@126.com> Closes #4227 from OopsOutOfMemory/describe and squashes the following commits: 053826f [OopsOutOfMemory] describe
* [SPARK-5617][SQL] fix test failure of SQLQuerySuitewangfei2015-02-052-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SQLQuerySuite test failure: [info] - simple select (22 milliseconds) [info] - sorting (722 milliseconds) [info] - external sorting (728 milliseconds) [info] - limit (95 milliseconds) [info] - date row *** FAILED *** (35 milliseconds) [info] Results do not match for query: [info] 'Limit 1 [info] 'Project [CAST(2015-01-28, DateType) AS c0#3630] [info] 'UnresolvedRelation [testData], None [info] [info] == Analyzed Plan == [info] Limit 1 [info] Project [CAST(2015-01-28, DateType) AS c0#3630] [info] LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 [info] [info] == Physical Plan == [info] Limit 1 [info] Project [16463 AS c0#3630] [info] PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 [info] [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] ![2015-01-28] [2015-01-27] (QueryTest.scala:77) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328) [info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95) [info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300) [info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300) [info] at org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNode Author: wangfei <wangfei1@huawei.com> Closes #4395 from scwf/SQLQuerySuite and squashes the following commits: 1431a2d [wangfei] fix conflicts c35fe5e [wangfei] minor fix 01dab3a [wangfei] fix test failure of SQLQuerySuite
* [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.Reynold Xin2015-02-0418-8/+38
| | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4386 from rxin/df-implicits and squashes the following commits: 9d96606 [Reynold Xin] style fix edd296b [Reynold Xin] ReplSuite 1c946ab [Reynold Xin] [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
* [SPARK-5606][SQL] Support plus sign in HiveContextq002515982015-02-041-0/+1
| | | | | | | | | | | | | | Now spark version is only support ```SELECT -key FROM DECIMAL_UDF;``` in HiveContext. This patch is used to support ```SELECT +key FROM DECIMAL_UDF;``` in HiveContext. Author: q00251598 <qiyadong@huawei.com> Closes #4378 from watermen/SPARK-5606 and squashes the following commits: 777f132 [q00251598] sql-case22 74dd368 [q00251598] sql-case22 1a67410 [q00251598] sql-case22 c5cd5bc [q00251598] sql-case22
* [SPARK-5602][SQL] Better support for creating DataFrame from local data ↵Reynold Xin2015-02-0410-88/+170
| | | | | | | | | | | | | | collection 1. Added methods to create DataFrames from Seq[Product] 2. Added executeTake to avoid running a Spark job on LocalRelations. Author: Reynold Xin <rxin@databricks.com> Closes #4372 from rxin/localDataFrame and squashes the following commits: f696858 [Reynold Xin] style checker. 839ef7f [Reynold Xin] [SPARK-5602][SQL] Better support for creating DataFrame from local data collection.
* [SPARK-5538][SQL] Fix flaky CachedTableSuiteReynold Xin2015-02-041-4/+19
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #4379 from rxin/CachedTableSuite and squashes the following commits: f2b44ce [Reynold Xin] [SQL] Fix flaky CachedTableSuite.
* [SQL][DataFrame] Minor cleanup.Reynold Xin2015-02-041-194/+2
| | | | | | | | | | | 1. Removed LocalHiveContext in Python. 2. Reduced DSL UDF support from 22 arguments to 10 arguments so JavaDoc/ScalaDoc look nicer. Author: Reynold Xin <rxin@databricks.com> Closes #4374 from rxin/df-style and squashes the following commits: e493342 [Reynold Xin] [SQL][DataFrame] Minor cleanup.
* [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as r...Sadhan Sood2015-02-044-14/+60
| | | | | | | | | | | | ...aised in SPARK-4520. The exception is thrown only for a thrift generated parquet file. The array element schema name is assumed as "array" as per ParquetAvro but for thrift generated parquet files, it is array_name + "_tuple". This leads to missing child of array group type and hence when the parquet rows are being materialized leads to the exception. Author: Sadhan Sood <sadhan@tellapart.com> Closes #4148 from sadhan/SPARK-4520 and squashes the following commits: c5ccde8 [Sadhan Sood] [SPARK-4520] [SQL] This pr fixes the ArrayIndexOutOfBoundsException as raised in SPARK-4520.
* [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL ↵Reynold Xin2015-02-045-11/+48
| | | | | | | | | | | aggregate functions Author: Reynold Xin <rxin@databricks.com> Closes #4376 from rxin/SPARK-5605 and squashes the following commits: c55f5fa [Reynold Xin] Added a Python test. f4b8dbb [Reynold Xin] [SPARK-5605][SQL][DF] Allow using String to specify colum name in DSL aggregate functions.
* [SPARK-5577] Python udf for DataFrameDavies Liu2015-02-042-2/+44
| | | | | | | | | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #4351 from davies/python_udf and squashes the following commits: d250692 [Davies Liu] fix conflict 34234d4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf 440f769 [Davies Liu] address comments f0a3121 [Davies Liu] track life cycle of broadcast f99b2e1 [Davies Liu] address comments 462b334 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf 7bccc3b [Davies Liu] python udf 58dee20 [Davies Liu] clean up