aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [HOTFIX] [SPARK-8489] Correct JIRA number in previous commitAndrew Or2015-06-194-8/+8
| | | | It should be SPARK-8489, not SPARK-8498.
* [SPARK-8498] [SQL] Add regression test for SPARK-8470Andrew Or2015-06-194-0/+76
| | | | | | | | | | | | | | **Summary of the problem in SPARK-8470.** When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes. **What this issue is about.** This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself. I tested this with and without the fix in #6891 and verified that it passes only if the fix is present. Author: Andrew Or <andrew@databricks.com> Closes #6909 from andrewor14/SPARK-8498 and squashes the following commits: 5e9d688 [Andrew Or] Add regression test for SPARK-8470
* [SPARK-8420] [SQL] Fix comparision of timestamps/dates with stringsMichael Armbrust2015-06-196-11/+88
| | | | | | | | | | | | | | | | | | | | | In earlier versions of Spark SQL we casted `TimestampType` and `DataType` to `StringType` when it was involved in a binary comparison with a `StringType`. This allowed comparing a timestamp with a partial date as a user would expect. - `time > "2014-06-10"` - `time > "2014"` In 1.4.0 we tried to cast the String instead into a Timestamp. However, since partial dates are not a valid complete timestamp this results in `null` which results in the tuple being filtered. This PR restores the earlier behavior. Note that we still special case equality so that these comparisons are not affected by not printing zeros for subsecond precision. Author: Michael Armbrust <michael@databricks.com> Closes #6888 from marmbrus/timeCompareString and squashes the following commits: bdef29c [Michael Armbrust] test partial date 1f09adf [Michael Armbrust] special handling of equality 1172c60 [Michael Armbrust] more test fixing 4dfc412 [Michael Armbrust] fix tests aaa9508 [Michael Armbrust] newline 04d908f [Michael Armbrust] [SPARK-8420][SQL] Fix comparision of timestamps/dates with strings
* [SPARK-8093] [SQL] Remove empty structs inferred from JSON documentsNathan Howell2015-06-193-17/+48
| | | | | | | | Author: Nathan Howell <nhowell@godaddy.com> Closes #6799 from NathanHowell/spark-8093 and squashes the following commits: 76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents
* [SPARK-8461] [SQL] fix codegen with REPL class loaderDavies Liu2015-06-195-34/+18
| | | | | | | | | | | | | | The ExecutorClassLoader for REPL will cause Janino failed to find class for those in java.lang, so switch to use default class loader for Janino, which will also help performance. cc liancheng yhuai Author: Davies Liu <davies@databricks.com> Closes #6898 from davies/fix_class_loader and squashes the following commits: 24276d4 [Davies Liu] add regression test 4ff0457 [Davies Liu] address comment, refactor 7f5ffbe [Davies Liu] fix REPL class loader with codegen
* [SPARK-8368] [SPARK-8058] [SQL] HiveContext may override the context class ↵Yin Huai2015-06-195-15/+219
| | | | | | | | | | | | | | | | | loader of the current thread https://issues.apache.org/jira/browse/SPARK-8368 Also, I add tests according https://issues.apache.org/jira/browse/SPARK-8058. Author: Yin Huai <yhuai@databricks.com> Closes #6891 from yhuai/SPARK-8368 and squashes the following commits: 37bb3db [Yin Huai] Update test timeout and comment. 8762eec [Yin Huai] Style. 695cd2d [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper. b3378fe [Yin Huai] Failed tests.
* [SPARK-8234][SQL] misc function: md5Shilei2015-06-195-0/+117
| | | | | | | | | | | | | | | | Author: Shilei <shilei.qian@intel.com> Closes #6779 from qiansl127/MD5 and squashes the following commits: 11fcdb2 [Shilei] Fix the indent 04bd27b [Shilei] Add codegen da60eb3 [Shilei] Remove checkInputDataTypes function 9509ad0 [Shilei] Format code 12c61f4 [Shilei] Accept only BinaryType for Md5 1df0b5b [Shilei] format to scala type 60ccde1 [Shilei] Add more test case b8c73b4 [Shilei] Rewrite the type check for Md5 c166167 [Shilei] Add md5 function
* [SPARK-8207] [SQL] Add math function binLiang-Chi Hsieh2015-06-195-8/+88
| | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8207 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6721 from viirya/expr_bin and squashes the following commits: 07e1c8f [Liang-Chi Hsieh] Remove AbstractUnaryMathExpression and let BIN inherit UnaryExpression. 0677f1a [Liang-Chi Hsieh] For comments. cf62b95 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin 0cf20f2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin dea9c12 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin d4f4774 [Liang-Chi Hsieh] Add @ignore_unicode_prefix. 7a0196f [Liang-Chi Hsieh] Fix python style. ac2bacd [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin a0a2d0f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin 4cb764d [Liang-Chi Hsieh] For comments. 0f78682 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin c0c3197 [Liang-Chi Hsieh] Add bin to FunctionRegistry. 824f761 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin 50e0c3b [Liang-Chi Hsieh] Add math function bin(a: long): string.
* [SPARK-8348][SQL] Add in operator to DataFrame ColumnYu ISHIKAWA2015-06-182-1/+17
| | | | | | | | | | | | | | I have added it for only Scala. TODO: we should also support `in` operator in Python. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6824 from yu-iskw/SPARK-8348 and squashes the following commits: e76d02f [Yu ISHIKAWA] Not use infix notation 6f744ac [Yu ISHIKAWA] Fit the test cases because these used the old test data set. 00077d3 [Yu ISHIKAWA] [SPARK-8348][SQL] Add in operator to DataFrame Column
* [SPARK-8458] [SQL] Don't strip scheme part of output path when writing ORC filesCheng Lian2015-06-181-1/+1
| | | | | | | | | | `Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead. Author: Cheng Lian <lian@databricks.com> Closes #6892 from liancheng/spark-8458 and squashes the following commits: 87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files
* [SPARK-8135] Don't load defaults when reconstituting Hadoop ConfigurationsSandy Ryza2015-06-1810-27/+30
| | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #6679 from sryza/sandy-spark-8135 and squashes the following commits: c5554ff [Sandy Ryza] SPARK-8135. In SerializableWritable, don't load defaults when instantiating Configuration
* [SPARK-8218][SQL] Binary log math function update.Reynold Xin2015-06-181-0/+4
| | | | | | | | | | | | Some minor updates based on after merging #6725. Author: Reynold Xin <rxin@databricks.com> Closes #6871 from rxin/log and squashes the following commits: ab51542 [Reynold Xin] Use JVM log 76fc8de [Reynold Xin] Fixed arg. a7c1522 [Reynold Xin] [SPARK-8218][SQL] Binary log math function update.
* [SPARK-8446] [SQL] Add helper functions for testing SparkPlan physical operatorsJosh Rosen2015-06-182-0/+211
| | | | | | | | | | | | | | | | | | | | | This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators. This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries. These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits: f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code 84214be [Josh Rosen] Add an extra column which isn't part of the sort ae1896b [Josh Rosen] Provide implicits automatically a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885 d9ab1e4 [Michael Armbrust] Add simple resolver c60a44d [Josh Rosen] Manually bind references 996332a [Josh Rosen] Add types so that tests compile a46144a [Josh Rosen] WIP
* [SPARK-8363][SQL] Move sqrt to math and extend UnaryMathExpressionLiang-Chi Hsieh2015-06-188-51/+31
| | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8363 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6823 from viirya/move_sqrt and squashes the following commits: 8977e11 [Liang-Chi Hsieh] Remove unnecessary old tests. d23e79e [Liang-Chi Hsieh] Explicitly indicate sqrt value sequence. 699f48b [Liang-Chi Hsieh] Use correct @since tag. 8dff6d1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into move_sqrt bc2ed77 [Liang-Chi Hsieh] Remove/move arithmetic expression test and expression type checking test. Remove unnecessary Sqrt type rule. d38492f [Liang-Chi Hsieh] Now sqrt accepts boolean because type casting is handled by HiveTypeCoercion. 297cc90 [Liang-Chi Hsieh] Sqrt only accepts double input. ef4a21a [Liang-Chi Hsieh] Move sqrt to math.
* [SPARK-8283][SQL] Resolve udf_struct test failure in HiveCompatibilitySuiteYijie Shen2015-06-172-5/+10
| | | | | | | | | | | | | | | | This PR aimed to resolve udf_struct test failure in HiveCompatibilitySuite. Currently, this is done by loosening CreateStruct's children type from NamedExpression to Expression and automatically generating StructField name for non-NamedExpression children. The naming convention for unnamed children follows the udf's counterpart in Hive: `col1, col2, col3, ...` Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #6828 from yijieshen/SPARK-8283 and squashes the following commits: 6052b73 [Yijie Shen] Doc fix 677e0b7 [Yijie Shen] Resolve udf_struct test failure by automatically generate structField name for non-NamedExpression children
* [SPARK-8218][SQL] Add binary log math functionLiang-Chi Hsieh2015-06-175-0/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8218 Because there is already `log` unary function defined, the binary log function is called `logarithm` for now. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6725 from viirya/expr_binary_log and squashes the following commits: bf96bd9 [Liang-Chi Hsieh] Compare log result in string. 102070d [Liang-Chi Hsieh] Round log result to better comparing in python test. fd01863 [Liang-Chi Hsieh] For comments. beed631 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 6089d11 [Liang-Chi Hsieh] Remove unnecessary override. 8cf37b7 [Liang-Chi Hsieh] For comments. bc89597 [Liang-Chi Hsieh] For comments. db7dc38 [Liang-Chi Hsieh] Use ctor instead of companion object. 0634ef7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 1750034 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 3d75bfc [Liang-Chi Hsieh] Fix scala style. 5b39c02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 23c54a3 [Liang-Chi Hsieh] Fix scala style. ebc9929 [Liang-Chi Hsieh] Let Logarithm accept one parameter too. 605574d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 21c3bfd [Liang-Chi Hsieh] Fix scala style. c6c187f [Liang-Chi Hsieh] For comments. c795342 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log f373bac [Liang-Chi Hsieh] Add binary log expression.
* [SPARK-7961][SQL]Refactor SQLConf to display better error messagezsxwing2015-06-1736-294/+859
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`. 2. Verify the value when setting a configuration if this is in SQLConf. 3. Use `SET -v` to display all public configurations. Author: zsxwing <zsxwing@gmail.com> Closes #6747 from zsxwing/sqlconf and squashes the following commits: 7d09bad [zsxwing] Use SQLConfEntry in HiveContext 49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext e014f53 [zsxwing] Merge branch 'master' into sqlconf 93dad8e [zsxwing] Fix the unit tests cf950c1 [zsxwing] Fix the code style and tests 3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style a2f4add [zsxwing] getConf will return the default value if a config is not set 037b1db [zsxwing] Add schema to SetCommand 0520c3c [zsxwing] Merge branch 'master' into sqlconf 7afb0ec [zsxwing] Fix the configurations about HiveThriftServer 7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString' 5e95b10 [zsxwing] Add enumConf c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString 4abd807 [zsxwing] Fix the test for 'set -v' 6e47e56 [zsxwing] Fix the compilation error 8973ced [zsxwing] Remove floatConf 1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead 99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string 88a03cc [zsxwing] Add new lines between confs and return types ce7c6c8 [zsxwing] Remove seqConf f3c1b33 [zsxwing] Refactor SQLConf to display better error message
* [SPARK-8381][SQL]reuse typeConvert when convert Seq[Row] to catalyst typeLianhui Wang2015-06-175-22/+12
| | | | | | | | | | | | | | reuse-typeConvert when convert Seq[Row] to CatalystType Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #6831 from lianhuiwang/reuse-typeConvert and squashes the following commits: 1fec395 [Lianhui Wang] remove CatalystTypeConverters.convertToCatalyst 714462d [Lianhui Wang] add package[sql] 9d1fbf3 [Lianhui Wang] address JoshRosen's comments 768956f [Lianhui Wang] update scala style 4498c62 [Lianhui Wang] reuse typeConvert
* [SPARK-8397] [SQL] Allow custom configuration for TestHivePunya Biswal2015-06-171-1/+1
| | | | | | | | | | | | | | | | We encourage people to use TestHive in unit tests, because it's impossible to create more than one HiveContext within one process. The current implementation locks people into using a local[2] SparkContext underlying their HiveContext. We should make it possible to override this using a system property so that people can test against local-cluster or remote spark clusters to make their tests more realistic. Author: Punya Biswal <pbiswal@palantir.com> Closes #6844 from punya/feature/SPARK-8397 and squashes the following commits: 97ef394 [Punya Biswal] [SPARK-8397][SQL] Allow custom configuration for TestHive
* [SPARK-8306] [SQL] AddJar command needs to set the new class loader to the ↵Yin Huai2015-06-174-3/+45
| | | | | | | | | | | | | | | | HiveConf inside executionHive.state. https://issues.apache.org/jira/browse/SPARK-8306 I will try to add a test later. marmbrus aarondav Author: Yin Huai <yhuai@databricks.com> Closes #6758 from yhuai/SPARK-8306 and squashes the following commits: 1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
* [SPARK-7067] [SQL] fix bug when use complex nested fields in ORDER BYWenchen Fan2015-06-175-66/+70
| | | | | | | | | | | | | | | | | | | | | This PR is a improvement for https://github.com/apache/spark/pull/5189. The resolution rule for ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails. There are 2 steps. First, try to resolve `Sort` in `ResolveReferences` based on select clause, and ignore exceptions. Second, try to resolve `Sort` in `ResolveSortReferences` and add missing projection. However, the way we resolve `SortOrder` is wrong. We just resolve `UnresolvedAttribute` and use the result to indicate if we can resolve `SortOrder`. But `UnresolvedAttribute` is only part of `GetField` chain(broken by `GetItem`), so we need to go through the whole chain to indicate if we can resolve `SortOrder`. With this change, we can also avoid re-throw GetField exception in `CheckAnalysis` which is little ugly. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5659 from cloud-fan/order-by and squashes the following commits: cfa79f8 [Wenchen Fan] update test 3245d28 [Wenchen Fan] minor improve 465ee07 [Wenchen Fan] address comment 1fc41a2 [Wenchen Fan] fix SPARK-7067
* [SPARK-8010] [SQL] Promote types to StringType as implicit conversion in ↵OopsOutOfMemory2015-06-172-3/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | non-binary expression of HiveTypeCoercion 1. Given a query `select coalesce(null, 1, '1') from dual` will cause exception: java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType 2. Given a query: `select case when true then 1 else '1' end from dual` will cause exception: java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType. Numeric types can be promoted to string type Hive will always do this implicit conversion. Author: OopsOutOfMemory <victorshengli@126.com> Closes #6551 from OopsOutOfMemory/pnts and squashes the following commits: 7a209d7 [OopsOutOfMemory] rebase master 6018613 [OopsOutOfMemory] convert function to method 4cd5618 [OopsOutOfMemory] limit the data type to primitive type df365d2 [OopsOutOfMemory] refine 95cbd58 [OopsOutOfMemory] fix style 403809c [OopsOutOfMemory] promote non-string to string when can not found tighestCommonTypeOfTwo
* [SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of childrenMichael Davies2015-06-172-12/+17
| | | | | | | | | | | | | | | | | | | For example large IN clauses Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s. s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')""" This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared). A lazily initialised Set based on children for contains reduces parse time to around 2.5s Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #6673 from MickDavies/SPARK-8077 and squashes the following commits: 38cd425 [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children d80103b [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children e6be8be [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children
* [SPARK-7199] [SQL] Add date and timestamp support to UnsafeRowLiang-Chi Hsieh2015-06-174-4/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7199 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5984 from viirya/add_date_timestamp and squashes the following commits: 7f21ce9 [Liang-Chi Hsieh] For comment. 0b89698 [Liang-Chi Hsieh] Add timestamp to settableFieldTypes. c30d490 [Liang-Chi Hsieh] Use default IntUnsafeColumnWriter and LongUnsafeColumnWriter. 672ef17 [Liang-Chi Hsieh] Remove getter/setter for Date and Timestamp and use Int and Long for them. 9f3e577 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 281e844 [Liang-Chi Hsieh] Fix scala style. fb532b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 80af342 [Liang-Chi Hsieh] Fix compiling error. f4f5de6 [Liang-Chi Hsieh] Fix scala style. a463e83 [Liang-Chi Hsieh] Use Long to store timestamp for rows. 635388a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 46946c6 [Liang-Chi Hsieh] Adapt for moved DateUtils. b16994e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 752251f [Liang-Chi Hsieh] Support setDate. Fix failed test. fcf8db9 [Liang-Chi Hsieh] Add functions for Date and Timestamp to SpecificRow. e42a809 [Liang-Chi Hsieh] Fix style. 4c07b57 [Liang-Chi Hsieh] Add date and timestamp support to UnsafeRow.
* [SPARK-8220][SQL]Add positive identify functiondragonli2015-06-164-0/+27
| | | | | | | | | | | | chenghao-intel adrian-wang Author: dragonli <lisurprise@gmail.com> Author: zhichao.li <zhichao.li@intel.com> Closes #6838 from zhichao-li/positive and squashes the following commits: e1032a0 [dragonli] remove useless import and refactor code 624d438 [zhichao.li] add positive identify function
* [SPARK-8156] [SQL] create table to specific database by 'use dbname'baishuo2015-06-163-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | when i test the following code: hiveContext.sql("""use testdb""") val df = (1 to 3).map(i => (i, s"val_$i", i * 2)).toDF("a", "b", "c") df.write .format("parquet") .mode(SaveMode.Overwrite) .saveAsTable("ttt3") hiveContext.sql("show TABLES in default") found that the table ttt3 will be created under the database "default" Author: baishuo <vc_java@hotmail.com> Closes #6695 from baishuo/SPARK-8516-use-database and squashes the following commits: 9e155f9 [baishuo] remove no use comment cb9f027 [baishuo] modify testcase 00a7a2d [baishuo] modify testcase 4df48c7 [baishuo] modify testcase b742e69 [baishuo] modify testcase 3d19ad9 [baishuo] create table to specific database
* [SQL] [DOC] improved a commentRadek Ostrowski2015-06-161-1/+1
| | | | | | | | | | | | [SQL][DOC] I found it a bit confusing when I came across it for the first time in the docs Author: Radek Ostrowski <dest.hawaii@gmail.com> Author: radek <radek@radeks-MacBook-Pro-2.local> Closes #6332 from radek1st/master and squashes the following commits: dae3347 [Radek Ostrowski] fixed typo c76bb3a [radek] improved a comment
* [SPARK-7184] [SQL] enable codegen by defaultDavies Liu2015-06-1521-81/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to have better performance out of box, this PR turn on codegen by default, then codegen can be tested by sql/test and hive/test. This PR also fix some corner cases for codegen. Before 1.5 release, we should re-visit this, turn it off if it's not stable or causing regressions. cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #6726 from davies/enable_codegen and squashes the following commits: f3b25a5 [Davies Liu] fix warning 73750ea [Davies Liu] fix long overflow when compare 3017a47 [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen a7d75da [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen ff5b75a [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen f4cf2c2 [Davies Liu] fix style 99fc139 [Davies Liu] Merge branch 'enable_codegen' of github.com:davies/spark into enable_codegen 91fc7a2 [Davies Liu] disable codegen for ScalaUDF 207e339 [Davies Liu] Update CodeGenerator.scala 44573a3 [Davies Liu] check thread safety of expression f3886fa [Davies Liu] don't inline primitiveTerm for null literal c8e7cd2 [Davies Liu] address comment a8618c9 [Davies Liu] enable codegen by default
* SPARK-8336 Fix NullPointerException with functions.rand()tedyu2015-06-152-1/+38
| | | | | | | | | | | | | | | This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()' Tested using spark-shell and verified that the following works: sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show() Author: tedyu <yuzhihong@gmail.com> Closes #6793 from tedyu/master and squashes the following commits: 62fd97b [tedyu] Create RandomSuite 750f92c [tedyu] Add test for Rand() with seed a1d66c5 [tedyu] Fix NullPointerException with functions.rand()
* [SPARK-6583] [SQL] Support aggregate functions in ORDER BYYadong Qi2015-06-152-3/+61
| | | | | | | | | | | | | | | | | | Add aggregates in ORDER BY clauses to the `Aggregate` operator beneath. Project these results away after the Sort. Based on work by watermen. Also Closes #5290. Author: Yadong Qi <qiyadong2010@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #6816 from marmbrus/pr/5290 and squashes the following commits: 3226a97 [Michael Armbrust] consistent ordering eb8938d [Michael Armbrust] no vars c8b25c1 [Yadong Qi] move the test data. 7f9b736 [Yadong Qi] delete Substring case a1e87c1 [Yadong Qi] fix conflict f119849 [Yadong Qi] order by aggregated function
* [SPARK-8065] [SQL] Add support for Hive 0.14 metastoresMarcelo Vanzin2015-06-146-294/+414
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change has two parts. The first one gets rid of "ReflectionMagic". That worked well for the differences between 0.12 and 0.13, but breaks in 0.14, since some of the APIs that need to be used have primitive types. I could not figure out a way to make that class work with primitive types. So instead I wrote some shims (I can already hear the collective sigh) that find the appropriate methods via reflection. This should be faster since the method instances are cached, and the code is not much uglier than before, with the advantage that all the ugliness is local to one file (instead of multiple switch statements on the version being used scattered in ClientWrapper). The second part is simple: add code to handle Hive 0.14. A few new methods had to be added to the new shims. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6627 from vanzin/SPARK-8065 and squashes the following commits: 3fa4270 [Marcelo Vanzin] Indentation style. 4b8a3d4 [Marcelo Vanzin] Fix dep exclusion. be3d0cc [Marcelo Vanzin] Merge branch 'master' into SPARK-8065 ca3fb1e [Marcelo Vanzin] Merge branch 'master' into SPARK-8065 b43f13e [Marcelo Vanzin] Since exclusions seem to work, clean up some of the code. 73bd161 [Marcelo Vanzin] Botched merge. d2ddf01 [Marcelo Vanzin] Comment about excluded dep. 0c929d1 [Marcelo Vanzin] Merge branch 'master' into SPARK-8065 2c3c02e [Marcelo Vanzin] Try to fix tests by adding support for exclusions. 0a03470 [Marcelo Vanzin] Try to fix tests by upgrading calcite dependency. 13b2dfa [Marcelo Vanzin] Fix NPE. 6439d88 [Marcelo Vanzin] Minor style thing. 69b017b [Marcelo Vanzin] Style. a21cad8 [Marcelo Vanzin] Part II: Add shims / version for Hive 0.14. ae98c87 [Marcelo Vanzin] PART I: Get rid of reflection magic.
* [SPARK-8362] [SQL] Add unit tests for +, -, *, /, %Reynold Xin2015-06-142-105/+99
| | | | | | | | | | | | | | | | | | | Added unit tests for all supported data types for: - Add - Subtract - Multiply - Divide - UnaryMinus - Remainder Fixed bugs caught by the unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #6813 from rxin/SPARK-8362 and squashes the following commits: fb3fe62 [Reynold Xin] Added Remainder. 3b266ba [Reynold Xin] [SPARK-8362] Add unit tests for +, -, *, /.
* [SPARK-8358] [SQL] Wait for child resolution when resolving generatorsMichael Armbrust2015-06-142-2/+12
| | | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #6811 from marmbrus/aliasExplodeStar and squashes the following commits: fbd2065 [Michael Armbrust] more style 806a373 [Michael Armbrust] fix style 7cbb530 [Michael Armbrust] [SPARK-8358][SQL] Wait for child resolution when resolving generatorsa
* [SPARK-8354] [SQL] Fix off-by-factor-of-8 error when allocating scratch ↵Josh Rosen2015-06-142-16/+16
| | | | | | | | | | | | | | space in UnsafeFixedWidthAggregationMap UnsafeFixedWidthAggregationMap contains an off-by-factor-of-8 error when allocating row conversion scratch space: we take a size requirement, measured in bytes, then allocate a long array of that size. This means that we end up allocating 8x too much conversion space. This patch fixes this by allocating a `byte[]` array instead. This doesn't impose any new limitations on the maximum sizes of UnsafeRows, since UnsafeRowConverter already used integers when calculating the size requirements for rows. Author: Josh Rosen <joshrosen@databricks.com> Closes #6809 from JoshRosen/sql-bytes-vs-words-fix and squashes the following commits: 6520339 [Josh Rosen] Updates to reflect fact that UnsafeRow max size is constrained by max byte[] size
* [SPARK-8342][SQL] Fix Decimal setOrNullLiang-Chi Hsieh2015-06-132-1/+7
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8342 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6797 from viirya/fix_decimal and squashes the following commits: 8a447b1 [Liang-Chi Hsieh] Add unit test. d67a5ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal ab6d8af [Liang-Chi Hsieh] Fix setOrNull.
* [SPARK-8349] [SQL] Use expression constructors (rather than apply) in ↵Reynold Xin2015-06-135-32/+22
| | | | | | | | | | | FunctionRegistry Author: Reynold Xin <rxin@databricks.com> Closes #6806 from rxin/gs and squashes the following commits: ed1aebb [Reynold Xin] Fixed style. c7fc3e6 [Reynold Xin] [SPARK-8349][SQL] Use expression constructors (rather than apply) in FunctionRegistry
* [SPARK-8347][SQL] Add unit tests for abs.Reynold Xin2015-06-135-33/+31
| | | | | | | | | | Also addressed code review feedback from #6754 Author: Reynold Xin <rxin@databricks.com> Closes #6803 from rxin/abs and squashes the following commits: d07beba [Reynold Xin] [SPARK-8347] Add unit tests for abs.
* [SPARK-8052] [SQL] Use java.math.BigDecimal for casting String to Decimal ↵Liang-Chi Hsieh2015-06-132-1/+6
| | | | | | | | | | | | | | instead of using toDouble JIRA: https://issues.apache.org/jira/browse/SPARK-8052 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6645 from viirya/cast_string_integraltype and squashes the following commits: e19c6a3 [Liang-Chi Hsieh] For comment. c3e472a [Liang-Chi Hsieh] Add test. 7ced9b0 [Liang-Chi Hsieh] Use java.math.BigDecimal for casting String to Decimal instead of using toDouble.
* [SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle ↵Josh Rosen2015-06-131-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dependencies This patch updates two pieces of logic that are related to handling of keyOrderings in ShuffleDependencies: - The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. This patch updates the fallback logic to reflect this so that the Tungsten optimizations can apply to more workloads. - The SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary. The copying was added to guard against cases where ExternalSorter would buffer non-serialized records in memory. When ExternalSorter is configured without an aggregator, it uses the following logic to determine whether to buffer records in a serialized or deserialized format: ```scala private val useSerializedPairBuffer = ordering.isEmpty && conf.getBoolean("spark.shuffle.sort.serializeMapOutputs", true) && ser.supportsRelocationOfSerializedObjects ``` The `newOrdering.isDefined` branch in `ExternalSorter.needToCopyObjectsBeforeShuffle`, removed by this patch, is not necessary: - It was checked even if we weren't using sort-based shuffle, but this was unnecessary because only SortShuffleManager performs map-side sorting. - Map-side sorting during shuffle writing is only performed for shuffles that perform map-side aggregation as part of the shuffle (to see this, look at how SortShuffleWriter constructs ExternalSorter). Since SQL never pushes aggregation into Spark's shuffle, we can guarantee that both the aggregator and ordering will be empty and Spark SQL always uses serializers that support relocation, so sort-shuffle will use the serialized pair buffer unless the user has explicitly disabled it via the SparkConf feature-flag. Therefore, I think my optimization in Exchange should be safe. Author: Josh Rosen <joshrosen@databricks.com> Closes #6773 from JoshRosen/SPARK-8319 and squashes the following commits: 7a14129 [Josh Rosen] Revise comments; add handler to guard against future ShuffleManager implementations 07bb2c9 [Josh Rosen] Update comment to clarify circumstances under which shuffle operates on serialized records 269089a [Josh Rosen] Avoid unnecessary copy in SQL Exchange 34e526e [Josh Rosen] Enable Tungsten shuffle for non-agg shuffles w/ key orderings
* [SPARK-8346] [SQL] Use InternalRow instread of catalyst.InternalRowDavies Liu2015-06-1322-183/+176
| | | | | | | | | | | cc rxin marmbrus Author: Davies Liu <davies@databricks.com> Closes #6802 from davies/cleanup_internalrow and squashes the following commits: 769d2aa [Davies Liu] remove not needed cast 4acbbe4 [Davies Liu] catalyst.Internal -> InternalRow
* [SPARK-7897] Improbe type for jdbc/"unsigned bigint"Rene Treffer2015-06-131-1/+1
| | | | | | | | | | | | The original fix uses DecimalType.Unlimited, which is harder to handle afterwards. There is no scale and most data should fit into a long, thus DecimalType(20,0) should be better. Author: Rene Treffer <treffer@measite.de> Closes #6789 from rtreffer/spark-7897-unsigned-bigint-as-decimal and squashes the following commits: 2006613 [Rene Treffer] Fix type for "unsigned bigint" jdbc loading.
* [SPARK-8329][SQL] Allow _ in DataSource optionsMichael Armbrust2015-06-122-4/+18
| | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #6786 from marmbrus/optionsParser and squashes the following commits: e7d18ef [Michael Armbrust] add dots 99a3452 [Michael Armbrust] [SPARK-8329][SQL] Allow _ in DataSource options
* [SPARK-7186] [SQL] Decouple internal Row from external RowDavies Liu2015-06-12132-973/+1160
| | | | | | | | | | | | | | | | | | | | | Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations. As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users. For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters. For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow. cc rxin marmbrus JoshRosen Author: Davies Liu <davies@databricks.com> Closes #6792 from davies/internal_row and squashes the following commits: f2abd13 [Davies Liu] fix scalastyle a7e025c [Davies Liu] move InternalRow into catalyst 30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row 7cbced8 [Davies Liu] separate Row and InternalRow
* [SPARK-7993] [SQL] Improved DataFrame.show() outputakhilthatipamula2015-06-122-6/+75
| | | | | | | | | | | | | | | | | Closes #6633 Author: akhilthatipamula <130050068@iitb.ac.in> Author: zsxwing <zsxwing@gmail.com> Closes #6784 from zsxwing/pr6633 and squashes the following commits: 5da1c51 [zsxwing] Address comments and add unit tests 17eab7b [akhilthatipamula] refactored code 19874b3 [akhilthatipamula] Update DataFrame.scala 0a76a5e [akhilthatipamula] Optimised showString() e3dd03f [akhilthatipamula] Modified showString() method a21012b [akhilthatipamula] improved the show() 4bb742f [akhilthatipamula] Modified dataframe.show() method
* [SQL] [MINOR] correct semanticEquals logicWenchen Fan2015-06-124-22/+25
| | | | | | | | | | | | | | It's a follow up of https://github.com/apache/spark/pull/6173, for expressions like `Coalesce` that have a `Seq[Expression]`, when we do semantic equal check for it, we need to do semantic equal check for all of its children. Also we can just use `Seq[(Expression, NamedExpression)]` instead of `Map[Expression, NamedExpression]` as we only search it with `find`. chenghao-intel, I agree that we probably never knows `semanticEquals` in a general way, but I think we have done that in `TreeNode`, so we can use similar logic. Then we can handle something like `Coalesce(children: Seq[Expression])` correctly. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6261 from cloud-fan/tmp and squashes the following commits: 4daef88 [Wenchen Fan] address comments dd8fbd9 [Wenchen Fan] correct semanticEquals
* [SPARK-6566] [SQL] Related changes for newer parquet versionYash Datta2015-06-122-185/+44
| | | | | | | | | | | | | | | | This brings in major improvement in that footers are not read on the driver. This also cleans up the code in parquetTableOperations, where we had to override getSplits to eliminate multiple listStatus calls. cc liancheng are there any other changes we need for this ? Author: Yash Datta <Yash.Datta@guavus.com> Closes #5889 from saucam/parquet_1.6 and squashes the following commits: d1bf41e [Yash Datta] SPARK-7340: Fix scalastyle and incorporate review comments c9aa042 [Yash Datta] SPARK-7340: Use the new user defined filter predicate for pushing down inset into parquet 56bc750 [Yash Datta] SPARK-7340: Change parquet version to latest release
* [SPARK-7862] [SQL] Fix the deadlock in script transformation for stderrzhichao.li2015-06-112-2/+17
| | | | | | | | | | | [Related PR SPARK-7044] (https://github.com/apache/spark/pull/5671) Author: zhichao.li <zhichao.li@intel.com> Closes #6404 from zhichao-li/transform and squashes the following commits: 8418c97 [zhichao.li] add comments and remove useless failAfter logic d9677e1 [zhichao.li] redirect the error desitination to be the same as the current process
* [SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operatorJosh Rosen2015-06-113-54/+24
| | | | | | | | | | | | | | | | | In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this: - Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side. - By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted. - We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying. This patch removes Exchange's `canSortWithShuffle` path and the associated code in `SqlSerializer2`. Shuffles that used to go through the `canSortWithShuffle` path would always wind up using Spark's `ExternalSorter` (inside of `HashShuffleReader`); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default. Author: Josh Rosen <joshrosen@databricks.com> Closes #6772 from JoshRosen/SPARK-8317 and squashes the following commits: ebf9c0f [Josh Rosen] Do not push sort into shuffle in Exchange operator bf3b4c8 [Josh Rosen] Enable external sort by default
* [SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after ↵Cheng Hao2015-06-112-1/+27
| | | | | | | | | | | | | | | cache() When df.cache() method called, the `withCachedData` of `QueryExecution` has been created, which mean it will not look up the cached tables when action method called afterward. Author: Cheng Hao <hao.cheng@intel.com> Closes #5714 from chenghao-intel/SPARK-7158 and squashes the following commits: 58ea8aa [Cheng Hao] style issue 2bf740f [Cheng Hao] create new QueryExecution instance for CacheManager a5647d9 [Cheng Hao] hide the queryExecution of DataFrame fbfd3c5 [Cheng Hao] make the DataFrame.queryExecution mutable for cache/persist/unpersist
* [SQL] Miscellaneous SQL/DF expression changes.Reynold Xin2015-06-117-27/+175
| | | | | | | | | | | | | | | | | | | | | | | | | | | SPARK-8201 conditional function: if SPARK-8205 conditional function: nvl SPARK-8208 math function: ceiling SPARK-8210 math function: degrees SPARK-8211 math function: radians SPARK-8219 math function: negative SPARK-8216 math function: rename log -> ln SPARK-8222 math function: alias power / pow SPARK-8225 math function: alias sign / signum SPARK-8228 conditional function: isnull SPARK-8229 conditional function: isnotnull SPARK-8250 string function: alias lower/lcase SPARK-8251 string function: alias upper / ucase Author: Reynold Xin <rxin@databricks.com> Closes #6754 from rxin/expressions-misc and squashes the following commits: 35fce15 [Reynold Xin] Removed println. 2647067 [Reynold Xin] Promote to string type. 3c32bbc [Reynold Xin] Fixed if. de827ac [Reynold Xin] Fixed style b201cd4 [Reynold Xin] Removed if. 6b21a9b [Reynold Xin] [SQL] Miscellaneous SQL/DF expression changes.