aboutsummaryrefslogtreecommitdiff
path: root/sql/hive
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-4552][SQL] Avoid exception when reading empty parquet data through HiveMichael Armbrust2014-12-032-44/+58
| | | | | | | | | | | This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way. Author: Michael Armbrust <michael@databricks.com> Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits: 2781d9f [Michael Armbrust] Handle empty lists for newParquet 04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive
* [SPARK-4695][SQL] Get result using executeCollectwangfei2014-12-021-1/+3
| | | | | | | | | | | | | Using ```executeCollect``` to collect the result, because executeCollect is a custom implementation of collect in spark sql which better than rdd's collect Author: wangfei <wangfei1@huawei.com> Closes #3547 from scwf/executeCollect and squashes the following commits: a5ab68e [wangfei] Revert "adding debug info" a60d680 [wangfei] fix test failure 0db7ce8 [wangfei] adding debug info 184c594 [wangfei] using executeCollect instead collect
* [SPARK-4529] [SQL] support view with column aliasDaoyuan Wang2014-12-012-3/+3
| | | | | | | | | | | | | | | | Support view definition like CREATE VIEW view3(valoo) TBLPROPERTIES ("fear" = "factor") AS SELECT upper(value) FROM src WHERE key=86; [valoo as the alias of upper(value)]. This is missing part of SPARK-4239, for a fully view support. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3396 from adrian-wang/viewcolumn and squashes the following commits: 4d001d0 [Daoyuan Wang] support view with column alias
* [SPARK-4661][Core] Minor code and docs cleanupzsxwing2014-12-011-1/+1
| | | | | | | | Author: zsxwing <zsxwing@gmail.com> Closes #3521 from zsxwing/SPARK-4661 and squashes the following commits: 03cbe3f [zsxwing] Minor code and docs cleanup
* [SPARK-4413][SQL] Parquet support through datasource APIMichael Armbrust2014-11-201-66/+112
| | | | | | | | | | | | | | | | | | | Goals: - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns) - Support for folder based partitioning with automatic discovery of available partitions - Caching of file metadata See scaladoc of `ParquetRelation2` for more details. Author: Michael Armbrust <michael@databricks.com> Closes #3269 from marmbrus/newParquet and squashes the following commits: 1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once. 645768b [Michael Armbrust] Review comments. abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API. 938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions. e9d2641 [Michael Armbrust] logging / formatting improvements.
* [SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector ↵Cheng Hao2014-11-204-8/+17
| | | | | | | | | | | | | | parameters Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding). This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization). Author: Cheng Hao <hao.cheng@intel.com> Closes #3109 from chenghao-intel/optimized and squashes the following commits: 487ff79 [Cheng Hao] rebase to the latest master & update the unittest
* [SPARK-2918] [SQL] Support the CTAS in EXPLAIN commandCheng Hao2014-11-202-1/+41
| | | | | | | | | | Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL. Author: Cheng Hao <hao.cheng@intel.com> Closes #3357 from chenghao-intel/explain and squashes the following commits: 7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
* Bumping version to 1.3.0-SNAPSHOT.Marcelo Vanzin2014-11-181-1/+1
| | | | | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3277 from vanzin/version-1.3 and squashes the following commits: 7c3c396 [Marcelo Vanzin] Added temp repo to sbt build. 5f404ff [Marcelo Vanzin] Add another exclusion. 19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo. 3c8d705 [Marcelo Vanzin] Workaround for MIMA checks. e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
* [SQL] Support partitioned parquet tables that have the key in both the ↵Michael Armbrust2014-11-182-68/+108
| | | | | | | | | | directory and the file Author: Michael Armbrust <michael@databricks.com> Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits: 447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
* [SPARK-4448] [SQL] unwrap for the ConstantObjectInspectorCheng Hao2014-11-171-4/+32
| | | | | | | | | Author: Cheng Hao <hao.cheng@intel.com> Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits: 156b500 [Cheng Hao] rebase the master c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector
* [SPARK-4443][SQL] Fix statistics for external table in spark sql hivew002289702014-11-173-3/+12
| | | | | | | | | | The `totalSize` of external table is always zero, which will influence join strategy(always use broadcast join for external table). Author: w00228970 <wangfei1@huawei.com> Closes #3304 from scwf/statistics and squashes the following commits: 568f321 [w00228970] fix statistics for external table
* [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes ↵Cheng Lian2014-11-171-67/+58
| | | | | | | | | | | | | | | | | | | | for complex types This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits: 866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile) 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
* Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, ↵Michael Armbrust2014-11-161-59/+68
| | | | | | | | | | and fixes for complex types" Author: Michael Armbrust <michael@databricks.com> Closes #3292 from marmbrus/revert4309 and squashes the following commits: 808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
* [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes ↵Cheng Lian2014-11-161-68/+59
| | | | | | | | | | | | | | | | | | | for complex types SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3178) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3178 from liancheng/date-for-thriftserver and squashes the following commits: 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
* [SPARK-4390][SQL] Handle NaN cast to decimal correctlyMichael Armbrust2014-11-142-0/+4
| | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #3256 from marmbrus/NanDecimal and squashes the following commits: 4c3ba46 [Michael Armbrust] fix style d360f83 [Michael Armbrust] Handle NaN cast to decimal
* SPARK-4375. no longer require -Pscala-2.10Sandy Ryza2014-11-141-3/+0
| | | | | | | | | | | | It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like. Author: Sandy Ryza <sandy@cloudera.com> Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits: 0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt cd42d94 [Sandy Ryza] Update doc f6644c3 [Sandy Ryza] SPARK-4375 take 2
* [SPARK-4245][SQL] Fix containsNull of the result ArrayType of CreateArray ↵Takuya UESHIN2014-11-143-1/+82
| | | | | | | | | | | | | | | | | | | expression. The `containsNull` of the result `ArrayType` of `CreateArray` should be `true` only if the children is empty or there exists nullable child. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3110 from ueshin/issues/SPARK-4245 and squashes the following commits: 6f64746 [Takuya UESHIN] Move equalsIgnoreNullability method into DataType. 5a90e02 [Takuya UESHIN] Refine InsertIntoHiveType and add some comments. cbecba8 [Takuya UESHIN] Fix a test title. 884ec37 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4245 3c5274b [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table. 41a94a9 [Takuya UESHIN] Replace InsertIntoTable with InsertIntoHiveTable if data types ignoring nullability are same. 43e6ef5 [Takuya UESHIN] Fix containsNull for empty array. 778e997 [Takuya UESHIN] Fix containsNull of the result ArrayType of CreateArray expression.
* [SPARK-4239] [SQL] support view in HiveQlDaoyuan Wang2014-11-1442-17/+5098
| | | | | | | | | | | | | | | | | | | | | | | | Currently still not support view like CREATE VIEW view3(valoo) TBLPROPERTIES ("fear" = "factor") AS SELECT upper(value) FROM src WHERE key=86; because the text in metastore for this view is like select \`_c0\` as \`valoo\` from (select upper(\`src\`.\`value\`) from \`default\`.\`src\` where ...) \`view3\` while catalyst cannot resolve \`_c0\` for this query. For view without colname definition in parentheses, it works fine. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3131 from adrian-wang/view and squashes the following commits: 8a56fd6 [Daoyuan Wang] michael's comments e46c056 [Daoyuan Wang] add some golden file 079290a [Daoyuan Wang] remove useless import 88afcad [Daoyuan Wang] support view in HiveQl
* [SPARK-4250] [SQL] Fix bug of constant null value mapping to ↵Cheng Hao2014-11-1011-86/+199
| | | | | | | | | | | | ConstantObjectInspector Author: Cheng Hao <hao.cheng@intel.com> Closes #3114 from chenghao-intel/constant_null_oi and squashes the following commits: e603bda [Cheng Hao] fix the bug of null value for primitive types 50a13ba [Cheng Hao] fix the timezone issue f54f369 [Cheng Hao] fix bug of constant null value for ObjectInspector
* [SQL] support udt to hive types conversion (hive->udt is not supported)Xiangrui Meng2014-11-102-1/+9
| | | | | | | | | | marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #3164 from mengxr/hive-udt and squashes the following commits: 57c7519 [Xiangrui Meng] support udt->hive types (hive->udt is not supported)
* SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use ↵Sean Owen2014-11-091-0/+1
| | | | | | | | | | | | | | | | | | | | package org.apache.hadoop andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619 I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage. So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away. This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems. Author: Sean Owen <sowen@cloudera.com> Closes #3048 from srowen/SPARK-1209 and squashes the following commits: 0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions 466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though? eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
* [SPARK-4203][SQL] Partition directories in random order when inserting into ↵Matthew Taylor2014-11-072-4/+43
| | | | | | | | | | | | | | | | hive table When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map<String,String> parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems Author: Matthew Taylor <matthew.t@tbfe.net> Closes #3076 from tbfenet/partition_dir_order_problem and squashes the following commits: f1b9a52 [Matthew Taylor] Comment format fix bca709f [Matthew Taylor] review changes 0e50f6b [Matthew Taylor] test fix 99f1a31 [Matthew Taylor] partition ordering fix 369e618 [Matthew Taylor] partition ordering fix
* [SPARK-4272] [SQL] Add more unwrapper functions for primitive type in ↵Cheng Hao2014-11-072-4/+15
| | | | | | | | | | | | | TableReader Currently, the data "unwrap" only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc. Author: Cheng Hao <hao.cheng@intel.com> Closes #3136 from chenghao-intel/table_reader and squashes the following commits: fffb729 [Cheng Hao] fix bug for retrieving the timestamp object e9c97a4 [Cheng Hao] Add more unwrapper functions for primitive type in TableReader
* [SQL] More aggressive defaultsMichael Armbrust2014-11-031-1/+1
| | | | | | | | | | | | | | | | | | | | - Turns on compression for in-memory cached data by default - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory) - Ups the batch size to 10,000 rows - Increases the broadcast threshold to 10mb. - Uses our parquet implementation instead of the hive one by default. - Cache parquet metadata by default. Author: Michael Armbrust <michael@databricks.com> Closes #3064 from marmbrus/fasterDefaults and squashes the following commits: 97ee9f8 [Michael Armbrust] parquet codec docs e641694 [Michael Armbrust] Remote also a12866a [Michael Armbrust] Cache metadata. 2d73acc [Michael Armbrust] Update docs defaults. d63d2d5 [Michael Armbrust] document parquet option da373f9 [Michael Armbrust] More aggressive defaults
* [SPARK-4152] [SQL] Avoid data change in CTAS while table already existedCheng Hao2014-11-033-3/+24
| | | | | | | | | | | | CREATE TABLE t1 (a String); CREATE TABLE t1 AS SELECT key FROM src; – throw exception CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect. Author: Cheng Hao <hao.cheng@intel.com> Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits: 194113e [Cheng Hao] fix bug in CTAS when table already existed
* [SPARK-3572] [SQL] Internal API for User-Defined TypesJoseph K. Bradley2014-11-021-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet). Author: Joseph K. Bradley <joseph@databricks.com> Author: Michael Armbrust <michael@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3063 from marmbrus/udts and squashes the following commits: 7ccfc0d [Michael Armbrust] remove println 46a3aee [Michael Armbrust] Slightly easier to read test output. 6cc434d [Michael Armbrust] Recursively convert rows. e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts 15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2 f3c72fe [Joseph K. Bradley] Fixing merge e13cd8a [Joseph K. Bradley] Removed Vector UDTs 5817b2b [Joseph K. Bradley] style edits 30ce5b2 [Joseph K. Bradley] updates based on code review d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs). Cleaned up other code. Extended JavaUserDefinedTypeSuite 6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean 20630bc [Joseph K. Bradley] fixed scalastyle fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now 8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument. 8b242ea [Joseph K. Bradley] Fixed merge error after last merge. Note: Last merge commit also removed SQL UDT examples from mllib. 7f29656 [Joseph K. Bradley] Moved udt case to top of all matches. Small cleanups b028675 [Xiangrui Meng] allow any type in UDT 4500d8a [Xiangrui Meng] update example code 87264a5 [Xiangrui Meng] remove debug code 3143ac3 [Xiangrui Meng] remove unnecessary changes cfbc321 [Xiangrui Meng] support UDT in parquet db16139 [Joseph K. Bradley] Added more doc for UserDefinedType. Removed unused code in Suite 759af7a [Joseph K. Bradley] Added more doc to UserDefineType 63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions 51e5282 [Joseph K. Bradley] fixed 1 test f025035 [Joseph K. Bradley] Cleanups before PR. Added new tests 85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors. dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run 34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL. e1f7b9c [Joseph K. Bradley] blah 2f40c02 [Joseph K. Bradley] renamed UDT types 3579035 [Joseph K. Bradley] udt annotation now working b226b9e [Joseph K. Bradley] Changing UDT to annotation fea04af [Joseph K. Bradley] more cleanups 964b32e [Joseph K. Bradley] some cleanups 893ee4c [Joseph K. Bradley] udt finallly working 50f9726 [Joseph K. Bradley] udts 04303c9 [Joseph K. Bradley] udts 39f8707 [Joseph K. Bradley] removed old udt suite 273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done 8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging 53de70f [Joseph K. Bradley] more udts... 982c035 [Joseph K. Bradley] still working on UDTs 19b2f60 [Joseph K. Bradley] still working on UDTs 0eaeb81 [Joseph K. Bradley] Still working on UDTs 105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.
* [SPARK-4191][SQL]move wrapperFor to HiveInspectors to reuse itwangfei2014-11-022-43/+48
| | | | | | | | | | | | | | Move wrapperFor in InsertIntoHiveTable to HiveInspectors to reuse them, this method can be reused when writing date with ObjectInspector(such as orc support) Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #3057 from scwf/reuse-wraperfor and squashes the following commits: 7ccf932 [scwf] fix conflicts d44f4da [wangfei] fix imports 9bf1b50 [wangfei] revert no related change 9a5276a [wangfei] move wrapfor to hiveinspector to reuse them
* [SPARK-3791][SQL] Provides Spark version and Hive version in HiveThriftServer2Cheng Lian2014-11-022-31/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity. TODO - [x] Find a general way to figure out Hive (or even any dependency) version. This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method: 1. must applies to both Maven build and SBT build For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds. 2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar. This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them. 3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development. `SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest. Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process? **Update** Hive version is now retrieved from the newly introduced `HiveShim` object. Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Cheng Lian <lian@databricks.com> Closes #2843 from liancheng/get-info and squashes the following commits: a873d0f [Cheng Lian] Updates test case 53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim 1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property
* [SPARK-3247][SQL] An API for adding data sources to Spark SQLMichael Armbrust2014-11-022-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in `org.apache.spark.sql.sources`. New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data. BaseRelations have three `Scan` subclasses, which are responsible for producing an RDD containing row objects. The [various Scan interfaces](https://github.com/marmbrus/spark/blob/foreign/sql/core/src/main/scala/org/apache/spark/sql/sources/package.scala#L50) allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations. By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL. I've used the functionality to update the JSON support so it can now be used in this way as follows: ```sql CREATE TEMPORARY TABLE jsonTableSQL USING org.apache.spark.sql.json OPTIONS ( path '/home/michael/data.json' ) ``` Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources There is also a library that uses this new API to read avro data available here: https://github.com/marmbrus/sql-avro Author: Michael Armbrust <michael@databricks.com> Closes #2475 from marmbrus/foreign and squashes the following commits: 1ed6010 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign ab2c31f [Michael Armbrust] fix test 1d41bb5 [Michael Armbrust] unify argument names 5b47901 [Michael Armbrust] Remove sealed, more filter types fab154a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign e3e690e [Michael Armbrust] Add hook for extraStrategies a70d602 [Michael Armbrust] Fix style, more tests, FilteredSuite => PrunedFilteredSuite 70da6d9 [Michael Armbrust] Modify API to ease binary compatibility and interop with Java 7d948ae [Michael Armbrust] Fix equality of AttributeReference. 5545491 [Michael Armbrust] Address comments 5031ac3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign 22963ef [Michael Armbrust] package objects compile wierdly... b069146 [Michael Armbrust] traits => abstract classes 34f836a [Michael Armbrust] Make @DeveloperApi 0d74bcf [Michael Armbrust] Add documention on object life cycle 3e06776 [Michael Armbrust] remove line wraps de3b68c [Michael Armbrust] Remove empty file 360cb30 [Michael Armbrust] style and java api 2957875 [Michael Armbrust] add override 0fd3a07 [Michael Armbrust] Draft of data sources API
* [HOTFIX][SQL] hive test missing some golden fileswangfei2014-11-0215-0/+8
| | | | | | | | | | cc marmbrus Author: wangfei <wangfei1@huawei.com> Closes #3055 from scwf/hotfix and squashes the following commits: d881bd7 [wangfei] miss golden files
* [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some ↵Matei Zaharia2014-11-017-31/+95
| | | | | | | | | | | | | | | | | | | | | | | | | optimizations - Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf) - Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs This is still marked WIP because there are a few TODOs, but I'll remove that tag when done. Author: Matei Zaharia <matei@databricks.com> Closes #2983 from mateiz/decimal-1 and squashes the following commits: 35e6b02 [Matei Zaharia] Fix issues after merge 227f24a [Matei Zaharia] Review comments 31f915e [Matei Zaharia] Implement Davies's suggestions in Python eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet 4dc6bae [Matei Zaharia] Fix decimal support in PySpark d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore 2118c0d [Matei Zaharia] Some test and bug fixes 81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions 7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double
* [SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2Cheng Lian2014-11-011-14/+30
| | | | | | | | | | | | | | | `HiveThriftServer2` creates a global singleton `SessionState` instance and overrides `HiveContext` to inject the `SessionState` object. This messes up `SessionState` initialization and causes problems. This PR replaces the global `SessionState` with `HiveContext.sessionState` to avoid the initialization conflict. Also `HiveContext` reuses existing started `SessionState` if any (this is required by `SparkSQLCLIDriver`, which uses specialized `CliSessionState`). Author: Cheng Lian <lian@databricks.com> Closes #2887 from liancheng/spark-4037 and squashes the following commits: 8446675 [Cheng Lian] Removes redundant Driver initialization a28fef5 [Cheng Lian] Avoid starting HiveContext.sessionState multiple times 49b1c5b [Cheng Lian] Reuses existing started SessionState if any 3cd6fab [Cheng Lian] Fixes SPARK-4037
* [SPARK-2220][SQL] Fixes remaining Hive commandsCheng Lian2014-10-313-3/+34
| | | | | | | | | | This PR adds support for the `ADD FILE` Hive command, and removes `ShellCommand` and `SourceCommand`. The reason is described in [this SPARK-2220 comment](https://issues.apache.org/jira/browse/SPARK-2220?focusedCommentId=14191841&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191841). Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #3038 from liancheng/hive-commands and squashes the following commits: 6db61e0 [Cheng Lian] Fixes remaining Hive commands
* [SPARK-4154][SQL] Query does not work if it has "not between " in Spark SQL ↵ravipesala2014-10-312-4/+14
| | | | | | | | | | | | | | and HQL if the query contains "not between" does not work like. SELECT * FROM src where key not between 10 and 20' Author: ravipesala <ravindra.pesala@huawei.com> Closes #3017 from ravipesala/SPARK-4154 and squashes the following commits: 65fc89e [ravipesala] Handled admin comments 32e6d42 [ravipesala] 'not between' is not working
* [SPARK-4077][SQL] Spark SQL return wrong values for valid string timestamp ↵Venkata Ramana Gollamudi2014-10-313-0/+27
| | | | | | | | | | | | | | | | values In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset. However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy. Author: Venkata Ramana G <ramana.gollamudihuawei.com> Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #3019 from gvramana/spark_4077 and squashes the following commits: 32d818f [Venkata Ramana Gollamudi] fixed check style fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object
* [SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1wangfei2014-10-311-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241. 1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility 2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version 3 SBT cmd for different version as follows: hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly 4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits: f26f3be [wangfei] remove clean to save time f5cac74 [wangfei] remove local hivecontext test 578234d [wangfei] use new shaded hive 18fb1ff [wangfei] exclude kryo in hive pom fa21d09 [wangfei] clean package assembly/assembly 8a4daf2 [wangfei] minor fix 0d7f6cf [wangfei] address comments f7c93ae [wangfei] adding build with hive 0.13 before running tests bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 c359822 [wangfei] reuse getCommandProcessor in hiveshim 52674a4 [scwf] sql/hive included since examples depend on it 3529e98 [scwf] move hive module to hive profile f51ff4e [wangfei] update and fix conflicts f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 41f727b [scwf] revert pom changes 13afde0 [scwf] fix small bug 4b681f4 [scwf] enable thriftserver in profile hive-0.13.1 0bc53aa [scwf] fixed when result filed is null dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver 7c66b8e [scwf] update pom according spark-2706 ae47489 [scwf] update and fix conflicts
* [SPARK-4143] [SQL] Move inner class DeferredObjectAdapter to top levelCheng Hao2014-10-301-13/+12
| | | | | | | | | | The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level. Author: Cheng Hao <hao.cheng@intel.com> Closes #3007 from chenghao-intel/move_deferred and squashes the following commits: 3a139b1 [Cheng Hao] Move inner class DeferredObjectAdapter to top level
* Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use ↵Andrew Or2014-10-301-1/+0
| | | | | | package org.apache.hadoop" This reverts commit 68cb69daf3022e973422e496ccf827ca3806ff30.
* SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package ↵Sean Owen2014-10-301-0/+1
| | | | | | | | | | | | | org.apache.hadoop (This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?) Author: Sean Owen <sowen@cloudera.com> Closes #2814 from srowen/SPARK-1209 and squashes the following commits: ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though? 2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
* [SPARK-3904] [SQL] add constant objectinspector support for udfsCheng Hao2014-10-2821-92/+307
| | | | | | | | | | | | | | | | | | | | | | | In HQL, we convert all of the data type into normal `ObjectInspector`s for UDFs, most of cases it works, however, some of the UDF actually requires its children `ObjectInspector` to be the `ConstantObjectInspector`, which will cause exception. e.g. select named_struct("x", "str") from src limit 1; I updated the method `wrap` by adding the one more parameter `ObjectInspector`(to describe what it expects to wrap to, for example: java.lang.Integer or IntWritable). As well as the `unwrap` method by providing the input `ObjectInspector`. Author: Cheng Hao <hao.cheng@intel.com> Closes #2762 from chenghao-intel/udf_coi and squashes the following commits: bcacfd7 [Cheng Hao] Shim for both Hive 0.12 & 0.13.1 2416e5d [Cheng Hao] revert to hive 0.12 5793c01 [Cheng Hao] add space before while 4e56e1b [Cheng Hao] style issue 683d3fd [Cheng Hao] Add golden files fe591e4 [Cheng Hao] update HiveGenericUdf for set the ObjectInspector while constructing the DeferredObject f6740fe [Cheng Hao] Support Constant ObjectInspector for Map & List 8814c3a [Cheng Hao] Passing ContantObjectInspector(when necessary) for UDF initializing
* [SPARK-3343] [SQL] Add serde support for CTASCheng Hao2014-10-2810-89/+311
| | | | | | | | | | | | Currently, `CTAS` (Create Table As Select) doesn't support specifying the `SerDe` in HQL. This PR will pass down the `ASTNode` into the physical operator `execution.CreateTableAsSelect`, which will extract the `CreateTableDesc` object via Hive `SemanticAnalyzer`. In the meantime, I also update the `HiveMetastoreCatalog.createTable` to optionally support the `CreateTableDesc` for table creation. Author: Cheng Hao <hao.cheng@intel.com> Closes #2570 from chenghao-intel/ctas_serde and squashes the following commits: e011ef5 [Cheng Hao] shim for both 0.12 & 0.13.1 cfb3662 [Cheng Hao] revert to hive 0.12 c8a547d [Cheng Hao] Support SerDe properties within CTAS
* [SPARK-3814][SQL] Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark ↵ravipesala2014-10-282-0/+28
| | | | | | | | | | | | | HQL and SQL Currently there is no support of Bitwise & , | in Spark HiveQl and Spark SQL as well. So this PR support the same. I am closing https://github.com/apache/spark/pull/2926 as it has conflicts to merge. And also added support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) And I handled all review comments in that PR Author: ravipesala <ravindra.pesala@huawei.com> Closes #2961 from ravipesala/SPARK-3814-NEW4 and squashes the following commits: a391c7a [ravipesala] Rebase with master
* [SPARK-3907][SQL] Add truncate table supportwangxiaojing2014-10-2729-1/+9
| | | | | | | | | | | | | | | | | | | | | | JIRA issue: [SPARK-3907]https://issues.apache.org/jira/browse/SPARK-3907 Add turncate table support TRUNCATE TABLE table_name [PARTITION partition_spec]; partition_spec: : (partition_col = partition_col_value, partition_col = partiton_col_value, ...) Removes all rows from a table or partition(s). Currently target table should be native/managed table or exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table. Author: wangxiaojing <u9jing@gmail.com> Closes #2770 from wangxiaojing/spark-3907 and squashes the following commits: 63dbd81 [wangxiaojing] change hive scalastyle 7a03707 [wangxiaojing] add comment f6e710e [wangxiaojing] change truncate table a1f692c [wangxiaojing] Correct spelling mistakes 3b20007 [wangxiaojing] add truncate can not support column err message e483547 [wangxiaojing] add golden file 77b1f20 [wangxiaojing] add truncate table support
* [SPARK-4041][SQL] Attributes names in table scan should converted to ↵wangfei2014-10-273-4/+20
| | | | | | | | | | | | | | | | | lowercase when compare with relation attributes In ```MetastoreRelation``` the attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase in ```indexWhere(_.name == a.name)```. ```neededColumnIDs``` may be not correct if not convert to lowercase. Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #2884 from scwf/fixColumnIds and squashes the following commits: 6174046 [scwf] use AttributeMap for this issue dc74a24 [wangfei] use lowerName and add a test case for this issue 3ff3a80 [wangfei] more safer change 294fcb7 [scwf] attributes names in table scan should convert lowercase in neededColumnsIDs
* [SPARK-3816][SQL] Add table properties from storage handler to output jobConfAlex Liu2014-10-271-0/+8
| | | | | | | | | | ...ob conf in SparkHadoopWriter class Author: Alex Liu <alex_liu68@yahoo.com> Closes #2677 from alexliu68/SPARK-SQL-3816 and squashes the following commits: 79c269b [Alex Liu] [SPARK-3816][SQL] Add table properties from storage handler to job conf
* [SPARK-3911] [SQL] HiveSimpleUdf can not be optimized in constant foldingCheng Hao2014-10-274-1/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` explain extended select cos(null) from src limit 1; ``` outputs: ``` Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] MetastoreRelation default, src, None == Optimized Logical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] MetastoreRelation default, src, None == Physical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] HiveTableScan [], (MetastoreRelation default, src, None), None ``` After patching this PR it outputs ``` == Parsed Logical Plan == Limit 1 Project ['cos(null) AS c_0#0] UnresolvedRelation None, src, None == Analyzed Logical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#0] MetastoreRelation default, src, None == Optimized Logical Plan == Limit 1 Project [null AS c_0#0] MetastoreRelation default, src, None == Physical Plan == Limit 1 Project [null AS c_0#0] HiveTableScan [], (MetastoreRelation default, src, None), None ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #2771 from chenghao-intel/hive_udf_constant_folding and squashes the following commits: 1379c73 [Cheng Hao] duplicate the PlanTest with catalyst/plans/PlanTest 1e52dda [Cheng Hao] add unit test for hive simple udf constant folding 01609ff [Cheng Hao] support constant folding for HiveSimpleUdf
* [SPARK-4042][SQL] Append columns ids and names before broadcastscwf2014-10-261-2/+3
| | | | | | | | | | Append columns ids and names before broadcast ```hiveExtraConf``` in ```HadoopTableReader```. Author: scwf <wangfei1@huawei.com> Closes #2885 from scwf/HadoopTableReader and squashes the following commits: a8c498c [scwf] append columns ids and names before broadcast
* [SPARK-4052][SQL] Use scala.collection.Map for pattern matching instead of ↵Yin Huai2014-10-262-0/+22
| | | | | | | | | | | | | using Predef.Map (it is scala.collection.immutable.Map) Please check https://issues.apache.org/jira/browse/SPARK-4052 for cases triggering this bug. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #2899 from yhuai/SPARK-4052 and squashes the following commits: 1188f70 [Yin Huai] Address liancheng's comments. b6712be [Yin Huai] Use scala.collection.Map instead of Predef.Map (scala.collection.immutable.Map).
* [SQL] Update Hive test harness for Hive 12 and 13Michael Armbrust2014-10-248160-47484/+38262
| | | | | | | | | | | | | As part of the upgrade I also copy the newest version of the query tests, and whitelist a bunch of new ones that are now passing. Author: Michael Armbrust <michael@databricks.com> Closes #2936 from marmbrus/fix13tests and squashes the following commits: d9cbdab [Michael Armbrust] Remove user specific tests 65801cd [Michael Armbrust] style and rat 8f6b09a [Michael Armbrust] Update test harness to work with both Hive 12 and 13. f044843 [Michael Armbrust] Update Hive query tests and golden files to 0.13
* [HOTFIX][SQL] Remove sleep on reset() failure.Michael Armbrust2014-10-241-4/+0
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #2934 from marmbrus/patch-2 and squashes the following commits: a96dab2 [Michael Armbrust] Remove sleep on reset() failure.