aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-8839] [SQL] ThriftServer2 will remove session and execution no matter ↵huangzhaowei2015-07-091-2/+5
| | | | | | | | | | | | | | | | | it's finished or not. In my test, `sessions` and `executions` in ThriftServer2 is not the same number as the connection number. For example, if there are 200 clients connecting to the server, but it will have more than 200 `sessions` and `executions`. So if it reaches the `retainedStatements`, it has to remove some object which is not finished. So it may cause the exception described in [Jira Address](https://issues.apache.org/jira/browse/SPARK-8839) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #7239 from SaintBacchus/SPARK-8839 and squashes the following commits: cf7ef40 [huangzhaowei] Remove the a meanless funciton call 3e9a5a6 [huangzhaowei] Add a filter before take 9d5ceb8 [huangzhaowei] [SPARK-8839][SQL]ThriftServer2 will remove session and execution no matter it's finished or not.
* [SPARK-8959] [SQL] [HOTFIX] Removes parquet-thrift and libthrift dependenciesCheng Lian2015-07-096-3480/+8
| | | | | | | | | | | | | | | | | These two dependencies were introduced in #7231 to help testing Parquet compatibility with `parquet-thrift`. However, they somehow crash the Scala compiler in Maven builds. This PR fixes this issue by: 1. Removing these two dependencies, and 2. Instead of generating the testing Parquet file programmatically, checking in an actual testing Parquet file generated by `parquet-thrift` as a test resource. This is just a quick fix to bring back Maven builds. Need to figure out the root case as binary Parquet files are harder to maintain. Author: Cheng Lian <lian@databricks.com> Closes #7330 from liancheng/spark-8959 and squashes the following commits: cf69512 [Cheng Lian] Brings back Maven builds
* [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of ↵Davies Liu2015-07-093-21/+118
| | | | | | | | | | | | | | | | serialization for Python DataFrame This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr . There is no generated `Row` anymore. Author: Davies Liu <davies@databricks.com> Closes #7301 from davies/sql_ser and squashes the following commits: 81bef71 [Davies Liu] address comments e9217bd [Davies Liu] add regression tests db34167 [Davies Liu] Refactor of serialization for Python DataFrame
* [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] ↵Cheng Hao2015-07-095-18/+923
| | | | | | | | | | | [SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] [SPARK-8267] [SQL] Add String Expressions Author: Cheng Hao <hao.cheng@intel.com> Closes #6762 from chenghao-intel/str_funcs and squashes the following commits: b09a909 [Cheng Hao] update the code as feedback 7ebbf4c [Cheng Hao] Add more string expressions
* [SPARK-8938][SQL] Implement toString for Interval data typeWenchen Fan2015-07-091-6/+18
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7315 from cloud-fan/toString and squashes the following commits: 4fc8d80 [Wenchen Fan] Implement toString for Interval data type
* [SPARK-8926][SQL] Code review followup.Reynold Xin2015-07-094-6/+23
| | | | | | | | | | | I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR. Author: Reynold Xin <rxin@databricks.com> Closes #7313 from rxin/adt and squashes the following commits: 7ade82b [Reynold Xin] Fixed unit tests. f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
* [SPARK-8948][SQL] Remove ExtractValueWithOrdinal abstract classReynold Xin2015-07-091-20/+34
| | | | | | | | | | | | Also added more documentation for the file. Author: Reynold Xin <rxin@databricks.com> Closes #7316 from rxin/extract-value and squashes the following commits: 069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal. 621b705 [Reynold Xin] Reverted a line. 11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type extractors.
* [SPARK-8830] [SQL] native levenshtein distanceTarek Auel2015-07-092-5/+9
| | | | | | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-8830 rxin and HuJiayin can you have a look on it. Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7236 from tarekauel/native-levenshtein-distance and squashes the following commits: ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for byte array comparison ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance 179920a [Tarek Auel] [SPARK-8830] added description 5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance
* [SPARK-8931] [SQL] Fallback to interpreted evaluation if failed to compile ↵Davies Liu2015-07-092-6/+58
| | | | | | | | | | | | | | | | | in codegen Exception will not be catched during tests. cc marmbrus rxin Author: Davies Liu <davies@databricks.com> Closes #7309 from davies/fallback and squashes the following commits: 969a612 [Davies Liu] throw exception during tests f844f77 [Davies Liu] fallback a3091bc [Davies Liu] Merge branch 'master' of github.com:apache/spark into fallback 364a0d6 [Davies Liu] fallback to interpret mode if failed to compile
* [SPARK-8942][SQL] use double not decimal when cast double and float to timestampWenchen Fan2015-07-091-12/+6
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7312 from cloud-fan/minor and squashes the following commits: a4589fa [Wenchen Fan] use double not decimal when cast double and float to timestamp
* [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when ↵Weizhong Lin2015-07-082-7/+9
| | | | | | | | | | | | handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7314 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
* Revert "[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- ↵Cheng Lian2015-07-082-9/+7
| | | | | | when handling Parquet LISTs in compatible mode" This reverts commit 3dab0da42940a46f0c4aa4853bdb5c64c4cb2613.
* [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when ↵Cheng Lian2015-07-082-7/+9
| | | | | | | | | | | | handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7304 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
* [SPARK-8926][SQL] Good errors for ExpectsInputType expressionsMichael Armbrust2015-07-0813-143/+256
| | | | | | | | | | | | | For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.` Author: Michael Armbrust <michael@databricks.com> Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits: c654a0e [Michael Armbrust] fix udts and make errors pretty 137160d [Michael Armbrust] style 5428fda [Michael Armbrust] style 10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
* [SPARK-8910] Fix MiMa flaky due to port contention issueAndrew Or2015-07-082-7/+8
| | | | | | | | | | | | Due to the way MiMa works, we currently start a `SQLContext` pretty early on. This causes us to start a `SparkUI` that attempts to bind to port 4040. Because many tests run in parallel on the Jenkins machines, this causes port contention sometimes and fails the MiMa tests. Note that we already disabled the SparkUI for scalatests. However, the MiMa test is run before we even have a chance to load the default scalatest settings, so we need to explicitly disable the UI ourselves. Author: Andrew Or <andrew@databricks.com> Closes #7300 from andrewor14/mima-flaky and squashes the following commits: b55a547 [Andrew Or] Do not enable SparkUI during tests
* [SPARK-8932] Support copy() for UnsafeRows that do not use ObjectPoolsJosh Rosen2015-07-084-19/+87
| | | | | | | | | | | | | | | | We call Row.copy() in many places throughout SQL but UnsafeRow currently throws UnsupportedOperationException when copy() is called. Supporting copying when ObjectPool is used may be difficult, since we may need to handle deep-copying of objects in the pool. In addition, this copy() method needs to produce a self-contained row object which may be passed around / buffered by downstream code which does not understand the UnsafeRow format. In the long run, we'll need to figure out how to handle the ObjectPool corner cases, but this may be unnecessary if other changes are made. Therefore, in order to unblock my sort patch (#6444) I propose that we support copy() for the cases where UnsafeRow does not use an ObjectPool and continue to throw UnsupportedOperationException when an ObjectPool is used. This patch accomplishes this by modifying UnsafeRow so that it knows the size of the row's backing data in order to be able to copy it into a byte array. Author: Josh Rosen <joshrosen@databricks.com> Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits: 338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use ObjectPools.
* [SPARK-8866][SQL] use 1us precision for timestamp typeYijie Shen2015-07-0810-49/+49
| | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8866 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7283 from yijieshen/micro_timestamp and squashes the following commits: dc735df [Yijie Shen] update CastSuite to avoid round error 714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp 8d4aa6b [Yijie Shen] use 1us precision for timestamp type
* [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrameDavies Liu2015-07-083-73/+54
| | | | | | | | | | | | | | | | | | | | This PR fixes the converter for Python DataFrame, especially for DecimalType Closes #7106 Author: Davies Liu <davies@databricks.com> Closes #7131 from davies/decimal_python and squashes the following commits: 4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 7d73168 [Davies Liu] fix conflit 6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 7104e97 [Davies Liu] improve type infer 9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES 829a05b [Davies Liu] fix UDT in python c99e8c5 [Davies Liu] fix mima c46814a [Davies Liu] convert decimal for Python DataFrames
* [SPARK-8914][SQL] Remove RDDApiKousuke Saruta2015-07-082-87/+19
| | | | | | | | | | | As rxin suggested in #7298 , we should consider to remove `RDDApi`. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7302 from sarutak/remove-rddapi and squashes the following commits: e495d35 [Kousuke Saruta] Fixed mima cb7ebb9 [Kousuke Saruta] Removed overriding RDDApi
* [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for ↵Cheng Lian2015-07-0824-904/+5947
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | interoperability and backwards-compatibility This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support. And this one fixes the read path. Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]). ### Major changes 1. `CatalystConverter` class hierarchy refactoring - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`. Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`. This simplifies the design since converters don't need to care about details of their parent converters anymore. - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter` Specifically, now all row objects are represented by `SpecificMutableRow` during conversion. - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter` `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal. The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way. - Implements backwards-compatibility rules in `CatalystArrayConverter` When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`. 2. Requested columns handling When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns. This is not preferable when taking compatibility and interoperability into consideration. Because the actual Parquet file may have different physical structure from the converted schema. In this PR, the schema for requested columns is constructed using the following method: - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column. - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`. - Unions all single-field `MessageType`s into a full schema containing all requested fields With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files. ### Testing This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in. [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 [2]: https://issues.apache.org/jira/browse/SPARK-6774 [3]: https://issues.apache.org/jira/browse/SPARK-6123 [4]: https://issues.apache.org/jira/browse/SPARK-8848 Author: Cheng Lian <lian@databricks.com> Closes #7231 from liancheng/spark-6776 and squashes the following commits: 360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite c6fbc06 [Cheng Lian] Removes WIP file committed by mistake b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa 598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift 926af87 [Cheng Lian] Simplifies Parquet compatibility test suites 7946ee1 [Cheng Lian] Fixes Scala styling issues 3d7ab36 [Cheng Lian] Fixes .rat-excludes a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation 1d390aa [Cheng Lian] Adds parquet-thrift compatibility test 440f7b3 [Cheng Lian] Adds generated files to .rat-excludes 13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite 06cfe9d [Cheng Lian] Adds comments about TimestampType handling a099d3e [Cheng Lian] More comments 0cc1b37 [Cheng Lian] Fixes MiMa checks 884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes 802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns 38fe1e7 [Cheng Lian] Adds explicit return type 7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change 1781dff [Cheng Lian] Adds test case for SPARK-8811 6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals a74fb2c [Cheng Lian] More comments 0525346 [Cheng Lian] Removes old Parquet record converters 03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
* [SPARK-8908] [SQL] Add () to distinct definition in dataframeCheolsoo Park2015-07-081-1/+1
| | | | | | | | | | Adding `()` to the definition of `distinct` in DataFrame allows distinct to be called with parentheses, which is consistent with `dropDuplicates`. Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7298 from piaozhexiu/SPARK-8908 and squashes the following commits: 7f0d923 [Cheolsoo Park] Add () to distinct definition in dataframe
* [SPARK-8783] [SQL] CTAS with WITH clause does not workKeuntae Park2015-07-082-1/+19
| | | | | | | | | | | | | Currently, CTESubstitution only handles the case that WITH is on the top of the plan. I think it SHOULD handle the case that WITH is child of CTAS. This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan. Author: Keuntae Park <sirpkt@apache.org> Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits: e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH 1671c77 [Keuntae Park] WITH clause can be inside CTAS
* [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer.Reynold Xin2015-07-081-13/+23
| | | | | | | | | | Just a baby step towards making it more efficient. Author: Reynold Xin <rxin@databricks.com> Closes #7282 from rxin/SPARK-8888 and squashes the following commits: 3da51ae [Reynold Xin] [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer.
* [SPARK-8753][SQL] Create an IntervalType data typeWenchen Fan2015-07-086-20/+138
| | | | | | | | | | | | | | | | | | | | We need a new data type to represent time intervals. Because we can't determine how many days in a month, so we need 2 values for interval: a int `months`, a long `microseconds`. The interval literal syntax looks like: `interval 3 years -4 month 4 weeks 3 second` Because we use number of 100ns as value of `TimestampType`, so it may not makes sense to support nano second unit. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7226 from cloud-fan/interval and squashes the following commits: 632062d [Wenchen Fan] address comments ac348c3 [Wenchen Fan] use case class 0342d2e [Wenchen Fan] use array byte df9256c [Wenchen Fan] fix style fd6f18a [Wenchen Fan] address comments 1856af3 [Wenchen Fan] support interval type
* [SPARK-5707] [SQL] fix serialization of generated projectionDavies Liu2015-07-083-4/+3
| | | | | | | | Author: Davies Liu <davies@databricks.com> Closes #7272 from davies/fix_projection and squashes the following commits: 075ef76 [Davies Liu] fix codegen with BroadcastHashJion
* [SPARK-6912] [SQL] Throw an AnalysisException when unsupported Java Map<K,V> ↵Takeshi YAMAMURO2015-07-084-0/+108
| | | | | | | | | | | | | types used in Hive UDF To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits: 916099a [Takeshi YAMAMURO] Fix style errors 7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF
* [SPARK-8785] [SQL] Improve Parquet schema mergingLiang-Chi Hsieh2015-07-081-34/+48
| | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8785 Currently, the parquet schema merging (`ParquetRelation2.readSchema`) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7182 from viirya/improve_parquet_merging and squashes the following commits: 5cf934f [Liang-Chi Hsieh] Refactor it to make it faster. f3411ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into improve_parquet_merging a63c3ff [Liang-Chi Hsieh] Improve Parquet schema merging.
* [SPARK-8883][SQL]Remove the OverrideFunctionRegistryCheng Hao2015-07-084-17/+3
| | | | | | | | | | | Remove the `OverrideFunctionRegistry` from the Spark SQL, as the subclasses of `FunctionRegistry` have their own way to the delegate to the right underlying `FunctionRegistry`. Author: Cheng Hao <hao.cheng@intel.com> Closes #7260 from chenghao-intel/override and squashes the following commits: 164d093 [Cheng Hao] enable the function registry 2ca8459 [Cheng Hao] remove the OverrideFunctionRegistry
* [SPARK-8879][SQL] Remove EmptyRow class.Reynold Xin2015-07-073-20/+9
| | | | | | | | | | As a baby step towards no megamorphic InternalRow. Author: Reynold Xin <rxin@databricks.com> Closes #7277 from rxin/remove-empty-row and squashes the following commits: 594100e [Reynold Xin] [SPARK-8879][SQL] Remove EmptyRow class.
* [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions.Reynold Xin2015-07-071-47/+61
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #7273 from rxin/bitwise-unittest and squashes the following commits: 60c5667 [Reynold Xin] [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions.
* [SPARK-8868] SqlSerializer2 can go into infinite loop when row consists only ↵Yin Huai2015-07-072-6/+39
| | | | | | | | | | | | | | of NullType columns https://issues.apache.org/jira/browse/SPARK-8868 Author: Yin Huai <yhuai@databricks.com> Closes #7262 from yhuai/SPARK-8868 and squashes the following commits: cb58780 [Yin Huai] Andrew's comment. e456857 [Yin Huai] Josh's comments. 5122e65 [Yin Huai] If types of all columns are NullTypes, do not use serializer2.
* [SPARK-7190] [SPARK-8804] [SPARK-7815] [SQL] unsafe UTF8StringDavies Liu2015-07-073-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let UTF8String work with binary buffer. Before we have better idea on manage the lifecycle of UTF8String in Row, we still do the copy when calling `UnsafeRow.get()` for StringType. cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7197 from davies/unsafe_string and squashes the following commits: 51b0ea0 [Davies Liu] fix test 50c1ebf [Davies Liu] remove optimization for upper/lower case 315d491 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string 93fce17 [Davies Liu] address comment e9ff7ba [Davies Liu] clean up 67ec266 [Davies Liu] fix bug 7b74b1f [Davies Liu] fallback to String if local dependent ab7857c [Davies Liu] address comments 7da92f5 [Davies Liu] handle local in toUpperCase/toLowerCase 59dbb23 [Davies Liu] revert python change d1e0716 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string 002e35f [Davies Liu] rollback hashCode change a87b7a8 [Davies Liu] improve toLowerCase and toUpperCase 76e794a [Davies Liu] fix test 8b2d5ce [Davies Liu] fix tests fd3f0a6 [Davies Liu] bug fix c4e9c88 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string c45d921 [Davies Liu] address comments 175405f [Davies Liu] unsafe UTF8String
* [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.Reynold Xin2015-07-0785-48/+114
| | | | | | | | | | The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about. Author: Reynold Xin <rxin@databricks.com> Closes #7270 from rxin/internalrow and squashes the following commits: 72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
* [SPARK-8794] [SQL] Make PrunedScan work for SampleLiang-Chi Hsieh2015-07-072-0/+34
| | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8794 Currently `PrunedScan` works only when followed by project or filter operations. However, even if there is a `Sample` between these operations and `PrunedScan`, `PrunedScan` should work too. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7228 from viirya/sample_prunedscan and squashes the following commits: ede7cd8 [Liang-Chi Hsieh] Keep PrunedScanSuite untouched. 6f05d30 [Liang-Chi Hsieh] Move unit test to FilterPushdownSuite. 5f32473 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan 7e4ba76 [Liang-Chi Hsieh] Use Optimzier for push down projection and filter. 0686830 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan df82785 [Liang-Chi Hsieh] Make PrunedScan work on Sample.
* [SPARK-8759][SQL] add default eval to binary and unary expression according ↵Wenchen Fan2015-07-0611-543/+292
| | | | | | | | | | | | to default behavior of nullable We have `nullSafeCodeGen` to provide default code generation for binary and unary expression, and we can do the same thing for `eval`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7157 from cloud-fan/refactor and squashes the following commits: f3987c6 [Wenchen Fan] refactor Expression
* [SPARK-6747] [SQL] Throw an AnalysisException when unsupported Java list ↵Takeshi YAMAMURO2015-07-064-2/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | types used in Hive UDF The current implementation can't handle List<> as a return type in Hive UDF and throws meaningless Match Error. We assume an UDF below; public class UDFToListString extends UDF { public List<String> evaluate(Object o) { return Arrays.asList("xxx", "yyy", "zzz"); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To make udf developers more understood, we need to throw a more suitable exception. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #7248 from maropu/FixBugInHiveInspectors and squashes the following commits: 1c3df2a [Takeshi YAMAMURO] Fix comments 56305de [Takeshi YAMAMURO] Fix conflicts 92ed7a6 [Takeshi YAMAMURO] Throw an exception when java list type used 2844a8e [Takeshi YAMAMURO] Apply comments 7114a47 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite fdb2ae4 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String af61f2e [Takeshi YAMAMURO] Remove a new type 7f812fd [Takeshi YAMAMURO] Fix code-style errors 6984bf4 [Takeshi YAMAMURO] Apply review comments 93e3d4e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString ee232db [Takeshi YAMAMURO] Support List as a return type in Hive UDF 1e82316 [Takeshi YAMAMURO] Apply comments 21e8763 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite a488712 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String 1c7b9d1 [Takeshi YAMAMURO] Remove a new type f965c34 [Takeshi YAMAMURO] Fix code-style errors 9406416 [Takeshi YAMAMURO] Apply review comments e21ce7e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString e553f10 [Takeshi YAMAMURO] Support List as a return type in Hive UDF
* [SPARK-8463][SQL] Use DriverRegistry to load jdbc driver at writing pathLiang-Chi Hsieh2015-07-061-5/+6
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8463 Currently, at the reading path, `DriverRegistry` is used to load needed jdbc driver at executors. However, at the writing path, we also need `DriverRegistry` to load jdbc driver. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6900 from viirya/jdbc_write_driver and squashes the following commits: 16cd04b [Liang-Chi Hsieh] Use DriverRegistry to load jdbc driver at writing path.
* [SPARK-8072] [SQL] Better AnalysisException for writing DataFrame with ↵animesh2015-07-063-1/+73
| | | | | | | | | | | | | | | | | | | | | | identically named columns Adding a function checkConstraints which will check for the constraints to be applied on the dataframe / dataframe schema. Function called before storing the dataframe to an external storage. Function added in the corresponding datasource API. cc rxin marmbrus Author: animesh <animesh@apache.spark> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #7013 from animeshbaranawal/8072 and squashes the following commits: f70dd0e [animesh] Change IO exception to Analysis Exception fd45e1b [animesh] 8072: Fix Style Issues a8a964f [animesh] 8072: Improving on previous commits 3cc4d2c [animesh] Fix Style Issues 1a89115 [animesh] Fix Style Issues 98b4399 [animesh] 8072 : Moved the exception handling to ResolvedDataSource specific to parquet format 7c3d928 [animesh] 8072: Adding check to DataFrameWriter.scala
* [SPARK-8588] [SQL] Regression testYin Huai2015-07-062-0/+37
| | | | | | | | | | | | | | This PR adds regression test for https://issues.apache.org/jira/browse/SPARK-8588 (fixed by https://github.com/apache/spark/commit/457d07eaa023b44b75344110508f629925eb6247). Author: Yin Huai <yhuai@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #7103 from yhuai/SPARK-8588-test and squashes the following commits: eb5f418 [Yin Huai] Add a query test. c61a173 [Yin Huai] Regression test for SPARK-8588.
* [MINOR] [SQL] remove unused code in ExchangeDaoyuan Wang2015-07-061-14/+0
| | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7234 from adrian-wang/exchangeclean and squashes the following commits: b093ec9 [Daoyuan Wang] remove unused code
* [SPARK-4485] [SQL] 1) Add broadcast hash outer join, (2) Fix SparkPlanTestkai2015-07-067-99/+441
| | | | | | | | | | | | | | | | | This pull request (1) extracts common functions used by hash outer joins and put it in interface HashOuterJoin (2) adds ShuffledHashOuterJoin and BroadcastHashOuterJoin (3) adds test cases for shuffled and broadcast hash outer join (3) makes SparkPlanTest to support binary or more complex operators, and fixes bugs in plan composition in SparkPlanTest Author: kai <kaizeng@eecs.berkeley.edu> Closes #7162 from kai-zeng/outer and squashes the following commits: 3742359 [kai] Fix not-serializable exception for code-generated keys in broadcasted relations 14e4bf8 [kai] Use CanBroadcast in broadcast outer join planning dc5127e [kai] code style fixes b5a4efa [kai] (1) Add broadcast hash outer join, (2) Fix SparkPlanTest
* [SPARK-8784] [SQL] Add Python API for hex and unhexDavies Liu2015-07-064-47/+65
| | | | | | | | | | | | | | | | | | Add Python API for hex/unhex, also cleanup Hex/Unhex Author: Davies Liu <davies@databricks.com> Closes #7223 from davies/hex and squashes the following commits: 6f1249d [Davies Liu] no explicit rule to cast string into binary 711a6ed [Davies Liu] fix test f9fe5a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex 49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex b31fc9a [Davies Liu] Update math.scala 25156b7 [Davies Liu] address comments and fix test c3af78c [Davies Liu] address commments 1a24082 [Davies Liu] Add Python API for hex and unhex
* [SPARK-8837][SPARK-7114][SQL] support using keyword in column nameWenchen Fan2015-07-062-10/+27
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7237 from cloud-fan/parser and squashes the following commits: e7b49bb [Wenchen Fan] support using keyword in column name
* [SPARK-8841] [SQL] Fix partition pruning percentage log messageSteve Lindemann2015-07-061-1/+1
| | | | | | | | | | | | When pruning partitions for a query plan, a message is logged indicating what how many partitions were selected based on predicate criteria, and what percent were pruned. The current release erroneously uses `1 - total/selected` to compute this quantity, leading to nonsense messages like "pruned -1000% partitions". The fix is simple and obvious. Author: Steve Lindemann <steve.lindemann@engineersgatelp.com> Closes #7227 from srlindemann/master and squashes the following commits: c788061 [Steve Lindemann] fix percentPruned log message
* [SPARK-8831][SQL] Support AbstractDataType in TypeCollection.Reynold Xin2015-07-053-6/+12
| | | | | | | | | | Otherwise it is impossible to declare an expression supporting DecimalType. Author: Reynold Xin <rxin@databricks.com> Closes #7232 from rxin/typecollection-adt and squashes the following commits: 934d3d1 [Reynold Xin] [SPARK-8831][SQL] Support AbstractDataType in TypeCollection.
* [SQL][Minor] Update the DataFrame API for encode/decodeCheng Hao2015-07-053-18/+25
| | | | | | | | | | This is a the follow up of #6843. Author: Cheng Hao <hao.cheng@intel.com> Closes #7230 from chenghao-intel/str_funcs2_followup and squashes the following commits: 52cc553 [Cheng Hao] update the code as comment
* [MINOR] [SQL] Minor fix for CatalystSchemaConverterLiang-Chi Hsieh2015-07-042-7/+7
| | | | | | | | | | ping liancheng Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7224 from viirya/few_fix_catalystschema and squashes the following commits: d994330 [Liang-Chi Hsieh] Minor fix for CatalystSchemaConverter.
* [SPARK-8822][SQL] clean up type checking in math.scala.Reynold Xin2015-07-042-168/+123
| | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #7220 from rxin/SPARK-8822 and squashes the following commits: 0cda076 [Reynold Xin] Test cases. 22d0463 [Reynold Xin] Fixed type precedence. beb2a97 [Reynold Xin] [SPARK-8822][SQL] clean up type checking in math.scala.
* [SQL] More unit tests for implicit type cast & add simpleString to ↵Reynold Xin2015-07-047-4/+42
| | | | | | | | | | | AbstractDataType. Author: Reynold Xin <rxin@databricks.com> Closes #7221 from rxin/implicit-cast-tests and squashes the following commits: 64b13bd [Reynold Xin] Fixed a bug .. 489b732 [Reynold Xin] [SQL] More unit tests for implicit type cast & add simpleString to AbstractDataType.
* Fixed minor style issue with the previous merge.Reynold Xin2015-07-041-2/+2
|