aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-9244] Increase some memory defaultsMatei Zaharia2015-07-221-2/+2
| | | | | | | | | | | | | | | | | | | There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. This memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. We increase it to 128. - spark.executor.memory: Defaults at 512m, which is really small. We increase it to 1g. Author: Matei Zaharia <matei@databricks.com> Closes #7586 from mateiz/configs and squashes the following commits: ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
* [SPARK-4366] [SQL] [Follow-up] Fix SqlParser compiling warning.Yin Huai2015-07-221-2/+1
| | | | | | | | Author: Yin Huai <yhuai@databricks.com> Closes #7588 from yhuai/SPARK-4366-update1 and squashes the following commits: 25f5f36 [Yin Huai] Fix SqlParser Warning.
* [SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoinDavies Liu2015-07-2219-132/+437
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin). It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR). Author: Davies Liu <davies@databricks.com> Closes #7480 from davies/unsafe_join and squashes the following commits: 6294b1e [Davies Liu] fix projection 10583f1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join dede020 [Davies Liu] fix test 84c9807 [Davies Liu] address comments a05b4f6 [Davies Liu] support UnsafeRow in LeftSemiJoinBNL and BroadcastNestedLoopJoin 611d2ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 9481ae8 [Davies Liu] return UnsafeRow after join() ca2b40f [Davies Liu] revert unrelated change 68f5cd9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 0f4380d [Davies Liu] ada a comment 69e38f5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 1a40f02 [Davies Liu] refactor ab1690f [Davies Liu] address comments 60371f2 [Davies Liu] use UnsafeRow in SemiJoin a6c0b7d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 184b852 [Davies Liu] fix style 6acbb11 [Davies Liu] fix tests 95d0762 [Davies Liu] remove println bea4a50 [Davies Liu] Unsafe HashJoin
* [SPARK-9165] [SQL] codegen for CreateArray, CreateStruct and CreateNamedStructYijie Shen2015-07-222-5/+76
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9165 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7537 from yjshen/array_struct_codegen and squashes the following commits: 3a6dce6 [Yijie Shen] use infix notion in createArray test 5e90f0a [Yijie Shen] resolve comments: classOf 39cefb8 [Yijie Shen] codegen for createArray createStruct & createNamedStruct
* [SPARK-9082] [SQL] Filter using non-deterministic expressions should not be ↵Wenchen Fan2015-07-222-11/+84
| | | | | | | | | | | | | | | pushed down Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7446 from cloud-fan/filter and squashes the following commits: 330021e [Wenchen Fan] add exists to tree node 2cab68c [Wenchen Fan] more enhance 949be07 [Wenchen Fan] push down part of predicate if possible 3912f84 [Wenchen Fan] address comments 8ce15ca [Wenchen Fan] fix bug 557158e [Wenchen Fan] Filter using non-deterministic expressions should not be pushed down
* [SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation ↵Yin Huai2015-07-2139-100/+3087
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improvement This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367. This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true. This new code path contains: * A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`) * A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`). * A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface . * A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together). With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is: 1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface. 2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true: * code-gen is disabled. * there is any function that cannot be converted (right now, Hive UDAFs). * the schema of grouping expressions contain any complex data type. * There are multiple distinct columns. Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function. Thanks chenghao-intel for his initial work on it. Author: Yin Huai <yhuai@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7458 from yhuai/UDAF and squashes the following commits: 7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it. b04d6c8 [Yin Huai] Remove unnecessary change. f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator. 3b43b24 [Yin Huai] bug fix. 00eb298 [Yin Huai] Make it compile. a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path. 8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose. dc96fd1 [Yin Huai] Many updates: 85c9c4b [Yin Huai] newline. 43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF c3614d7 [Yin Huai] Handle single distinct column. 68b8ee9 [Yin Huai] Support single distinct column set. WIP 3013579 [Yin Huai] Format. d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it. e243ca6 [Yin Huai] Add aggregation iterators. a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum. 594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2. 380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places. a19fea6 [Yin Huai] Add UDAF interface. 262d4c4 [Yin Huai] Make it compile. b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 6edb5ac [Yin Huai] Format update. 70b169c [Yin Huai] Remove groupOrdering. 4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules. d821a34 [Yin Huai] Cleanup. 32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 5b46d41 [Yin Huai] Bug fix. aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions. 2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer. 1b490ed [Michael Armbrust] make hive test 8cfa6a9 [Michael Armbrust] add test 1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places. 072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references. f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF 39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences. b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version. 5c00f3f [Michael Armbrust] First draft of codegen 6bbc6ba [Michael Armbrust] now with correct answers\! f7996d0 [Michael Armbrust] Add AlgebraicAggregate dded1c5 [Yin Huai] wip
* [SPARK-9232] [SQL] Duplicate code in JSONRelationAndrew Or2015-07-211-29/+21
| | | | | | | | Author: Andrew Or <andrew@databricks.com> Closes #7576 from andrewor14/clean-up-json-relation and squashes the following commits: ea80803 [Andrew Or] Clean up duplicate code
* [SPARK-9154][SQL] Rename formatString to format_string.Reynold Xin2015-07-215-42/+18
| | | | | | | | | | | | Also make format_string the canonical form, rather than printf. Author: Reynold Xin <rxin@databricks.com> Closes #7579 from rxin/format_strings and squashes the following commits: 53ee54f [Reynold Xin] Fixed unit tests. 52357e1 [Reynold Xin] Add format_string alias. b40a42a [Reynold Xin] [SPARK-9154][SQL] Rename formatString to format_string.
* [SPARK-9154] [SQL] codegen StringFormatTarek Auel2015-07-214-11/+70
| | | | | | | | | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9154 fixes bug of #7546 marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins? Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7571 from tarekauel/SPARK-9154 and squashes the following commits: dcae272 [Tarek Auel] [SPARK-9154][SQL] build fix 1487602 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-9154 f512c5f [Tarek Auel] [SPARK-9154][SQL] build fix a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives 10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format 086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
* [SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector.Dennis Huo2015-07-211-1/+1
| | | | | | | | | | | | | | | | | IsolatedClientLoader.isSharedClass includes all of com.google.\*, presumably for Guava, protobuf, and/or other shared Google libraries, but needs to count com.google.cloud.\* as "hive classes" when determining which ClassLoader to use. Otherwise, things like HiveContext.parquetFile will throw a ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://) path. On StackOverflow: http://stackoverflow.com/questions/31478955 EDIT: Adding yhuai who worked on the relevant classloading isolation pieces. Author: Dennis Huo <dhuo@google.com> Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits: 1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.
* [SPARK-8906][SQL] Move all internal data source classes into ↵Reynold Xin2015-07-2131-62/+77
| | | | | | | | | | | | | | execution.datasources. This way, the sources package contains only public facing interfaces. Author: Reynold Xin <rxin@databricks.com> Closes #7565 from rxin/move-ds and squashes the following commits: 7661aff [Reynold Xin] Mima 9d5196a [Reynold Xin] Rearranged imports. 3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.
* [SPARK-8357] Fix unsafe memory leak on empty inputs in GeneratedAggregatenavis.ryu2015-07-213-1/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a managed memory leak in GeneratedAggregate. The leak occurs when the unsafe aggregation path is used to perform grouped aggregation on an empty input; in this case, GeneratedAggregate allocates an UnsafeFixedWidthAggregationMap that is never cleaned up because `next()` is never called on the aggregate result iterator. This patch fixes this by short-circuiting on empty inputs. This patch is an updated version of #6810. Closes #6810. Author: navis.ryu <navis@apache.org> Author: Josh Rosen <joshrosen@databricks.com> Closes #7560 from JoshRosen/SPARK-8357 and squashes the following commits: 3486ce4 [Josh Rosen] Some minor cleanup c649310 [Josh Rosen] Revert SparkPlan change: 3c7db0f [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-8357 adc8239 [Josh Rosen] Back out Projection changes. c5419b3 [navis.ryu] addressed comments 143e1ef [navis.ryu] fixed format & added test for CCE case 735972f [navis.ryu] used new conf apis 1a02a55 [navis.ryu] Rolled-back test-conf cleanup & fixed possible CCE & added more tests 51178e8 [navis.ryu] addressed comments 4d326b9 [navis.ryu] fixed test fails 15c5afc [navis.ryu] added a test as suggested by JoshRosen d396589 [navis.ryu] added comments 1b07556 [navis.ryu] [SPARK-8357] [SQL] Memory leakage on unsafe aggregation path with empty input
* Revert "[SPARK-9154] [SQL] codegen StringFormat"Michael Armbrust2015-07-213-59/+11
| | | | | | | | | | | | This reverts commit 7f072c3d5ec50c65d76bd9f28fac124fce96a89e. Revert #7546 Author: Michael Armbrust <michael@databricks.com> Closes #7570 from marmbrus/revert9154 and squashes the following commits: ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"
* [SPARK-9154] [SQL] codegen StringFormatTarek Auel2015-07-213-11/+59
| | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9154 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7546 from tarekauel/SPARK-9154 and squashes the following commits: a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives 10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format 086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
* [SPARK-9081] [SPARK-9168] [SQL] nanvl & dropna/fillna supporting nan as wellYijie Shen2015-07-219-88/+222
| | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9081 https://issues.apache.org/jira/browse/SPARK-9168 This PR target at two modifications: 1. Change `isNaN` to return `false` on `null` input 2. Make `dropna` and `fillna` to fill/drop NaN values as well 3. Implement `nanvl` Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7523 from yjshen/fillna_dropna and squashes the following commits: f0a51db [Yijie Shen] make coalesce untouched and implement nanvl 1d3e35f [Yijie Shen] make Coalesce aware of NaN in order to support fillna 2760cbc [Yijie Shen] change isNaN(null) to false as well as implement dropna
* [SPARK-9173][SQL]UnionPushDown should also support Intersect and ExceptYijie Shen2015-07-213-70/+120
| | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9173 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7540 from yjshen/union_pushdown and squashes the following commits: 278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown 91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except
* [SPARK-8230][SQL] Add array/map size methodPedro Rodriguez2015-07-215-1/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230 Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket. Things to review: 1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python. 2. In Python code, should it be in a `1.5.0` function array or in a collections array? 3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case? 4. Something else? Author: Pedro Rodriguez <ski.rodriguez@gmail.com> Author: Pedro Rodriguez <prodriguez@trulia.com> Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits: 9a442ae [Pedro Rodriguez] fixed functions and sorted __all__ 9aea3bb [Pedro Rodriguez] removed imports from python docs 15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen d88247c [Pedro Rodriguez] removed python code bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge 59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print 130839f [Pedro Rodriguez] fixed failing test aa9bade [Pedro Rodriguez] fix style e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests 0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations 9a1a2ff [Pedro Rodriguez] added unit tests for map size 2bfbcb6 [Pedro Rodriguez] added unit test for size 20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays 99a6a5c [Pedro Rodriguez] fixed failing test cac75ac [Pedro Rodriguez] fix style 933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests 42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations f9c3b8a [Pedro Rodriguez] added unit tests for map size 2515d9f [Pedro Rodriguez] added documentation 0e60541 [Pedro Rodriguez] added unit test for size acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python 84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
* [SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replaceCheng Hao2015-07-217-4/+293
| | | | | | | | | | | | | Add expressions `regex_extract` & `regex_replace` Author: Cheng Hao <hao.cheng@intel.com> Closes #7468 from chenghao-intel/regexp and squashes the following commits: e5ea476 [Cheng Hao] minor update for documentation ef96fd6 [Cheng Hao] update the code gen 72cf28f [Cheng Hao] Add more log for compilation error 4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support
* [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORCCheng Lian2015-07-216-20/+38
| | | | | | | | | | | This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python. Author: Cheng Lian <lian@databricks.com> Closes #7444 from liancheng/spark-9100 and squashes the following commits: 284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
* [SPARK-9161][SQL] codegen FormatNumberTarek Auel2015-07-201-14/+54
| | | | | | | | | | Jira https://issues.apache.org/jira/browse/SPARK-9161 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7545 from tarekauel/SPARK-9161 and squashes the following commits: 21425c8 [Tarek Auel] [SPARK-9161][SQL] codegen FormatNumber
* [SPARK-9023] [SQL] Followup for #7456 (Efficiency improvements for ↵Josh Rosen2015-07-203-23/+161
| | | | | | | | | | | | | UnsafeRows in Exchange) This patch addresses code review feedback from #7456. Author: Josh Rosen <joshrosen@databricks.com> Closes #7551 from JoshRosen/unsafe-exchange-followup and squashes the following commits: 76dbdf8 [Josh Rosen] Add comments + more methods to UnsafeRowSerializer 3d7a1f2 [Josh Rosen] Add writeToStream() method to UnsafeRow
* [SPARK-9208][SQL] Remove variant of DataFrame string functions that accept ↵Reynold Xin2015-07-205-455/+74
| | | | | | | | | | | | | | column names. It can be ambiguous whether that is a string literal or a column name. cc marmbrus Author: Reynold Xin <rxin@databricks.com> Closes #7556 from rxin/str-exprs and squashes the following commits: 92afa83 [Reynold Xin] [SPARK-9208][SQL] Remove variant of DataFrame string functions that accept column names.
* [SPARK-9157] [SQL] codegen substringTarek Auel2015-07-201-43/+44
| | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-9157 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7534 from tarekauel/SPARK-9157 and squashes the following commits: e65e3e9 [Tarek Auel] [SPARK-9157] indent fix 44e89f8 [Tarek Auel] [SPARK-9157] use EMPTY_UTF8 37d54c4 [Tarek Auel] Merge branch 'master' into SPARK-9157 60732ea [Tarek Auel] [SPARK-9157] created substringSQL in UTF8String 18c3576 [Tarek Auel] [SPARK-9157][SQL] remove slice pos 1a2e611 [Tarek Auel] [SPARK-9157][SQL] codegen substring
* [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and ↵Josh Rosen2015-07-2012-24/+156
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | equality comparisons in Spark SQL This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`. Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value). In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147). Author: Josh Rosen <joshrosen@databricks.com> Closes #7194 from JoshRosen/nan and squashes the following commits: 983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan 88bd73c [Josh Rosen] Fix Row.equals() a702e2e [Josh Rosen] normalization -> canonicalization a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions c1fd4fe [Josh Rosen] Fold NaN test into existing test framework b31eb19 [Josh Rosen] Uncomment failing tests 7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145) 58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability." fc6b4d2 [Josh Rosen] Update CodeGenerator 3998ef2 [Josh Rosen] Remove unused code a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability. 6f03f85 [Josh Rosen] Fix bug in Double / Float ordering 42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite bfca524 [Josh Rosen] Change ordering so that NaN is maximum value. 8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume() b20837b [Josh Rosen] Add failing test for new NaN comparision ordering 5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan 630ebc5 [Josh Rosen] Specify an ordering for NaN values. 9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests 13fc06a [Josh Rosen] Add regression test for NaN sorting issue f9efbb5 [Josh Rosen] Fix ORDER BY NULL e7dc4fb [Josh Rosen] Add very generic test for ordering 7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL) b55875a [Josh Rosen] Generate doubles and floats over entire possible range. 5acdd5c [Josh Rosen] Infinity and NaN are interesting. ab76cbd [Josh Rosen] Move code to Catalyst package. d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
* [SPARK-9132][SPARK-9163][SQL] codegen convTarek Auel2015-07-204-161/+263
| | | | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9132 https://issues.apache.org/jira/browse/SPARK-9163 rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`. Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7552 from tarekauel/SPARK-9163 and squashes the following commits: 40dcde9 [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] style fix fa985bd [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] codegen conv
* [SPARK-9164] [SQL] codegen hex/unhexTarek Auel2015-07-201-39/+57
| | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9164 The diff looks heavy, but I just moved the `hex` and `unhex` methods to `object Hex`. This allows me to call them from `eval` and `codeGen` Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7548 from tarekauel/SPARK-9164 and squashes the following commits: dd91c57 [Tarek Auel] [SPARK-9164][SQL] codegen hex/unhex
* [SPARK-9142][SQL] Removing unnecessary self types in expressions.Reynold Xin2015-07-203-17/+22
| | | | | | | | | | Also added documentation to expressions to explain the important traits and abstract classes. Author: Reynold Xin <rxin@databricks.com> Closes #7550 from rxin/remove-self-types and squashes the following commits: b2a3ec1 [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in expressions.
* [SPARK-9156][SQL] codegen StringSplitTarek Auel2015-07-201-4/+8
| | | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9156 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7547 from tarekauel/SPARK-9156 and squashes the following commits: 0be2700 [Tarek Auel] [SPARK-9156][SQL] indention fix b860eaf [Tarek Auel] [SPARK-9156][SQL] codegen StringSplit 5ad6a1f [Tarek Auel] [SPARK-9156] codegen StringSplit
* [SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discoveryCheng Lian2015-07-207-90/+250
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery. The acceleration is done by the following means: - Turning off schema merging by default Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow. - Avoiding `FileSystem.globStatus()` call when possible `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3). This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`). This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially. - Listing leaf files in parallel when the number of input paths exceeds a threshold Listing leaf files is required by partition discovery. Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC. In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold. The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32. - Discovering Parquet schema in parallel Currently, schema merging is also done on driver side, and needs to read footers of all part-files. This PR uses a Spark job to do schema merging. Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now. Author: Cheng Lian <lian@databricks.com> Closes #7396 from liancheng/accel-parquet and squashes the following commits: 5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row] ff32cd0 [Cheng Lian] Excludes directories while listing leaf files 3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false" b1646aa [Cheng Lian] Should allow empty input paths 32e5f0d [Cheng Lian] Moves schema merging to executor side
* [SPARK-9160][SQL] codegen encode, decodeTarek Auel2015-07-201-2/+23
| | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9160 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7543 from tarekauel/SPARK-9160 and squashes the following commits: 7528f0e [Tarek Auel] [SPARK-9160][SQL] codegen encode, decode
* [SPARK-9159][SQL] codegen ascii, base64, unbase64Tarek Auel2015-07-202-7/+32
| | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9159 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7542 from tarekauel/SPARK-9159 and squashes the following commits: 772e6bc [Tarek Auel] [SPARK-9159][SQL] codegen ascii, base64, unbase64
* [SPARK-9155][SQL] codegen StringSpaceTarek Auel2015-07-201-5/+7
| | | | | | | | | | | | Jira https://issues.apache.org/jira/browse/SPARK-9155 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7531 from tarekauel/SPARK-9155 and squashes the following commits: 423c426 [Tarek Auel] [SPARK-9155] language typo fix e34bd1b [Tarek Auel] [SPARK-9155] moved creation of blank string to UTF8String 4bc33e6 [Tarek Auel] [SPARK-9155] codegen StringSpace
* [SPARK-6910] [SQL] Support for pushing predicates down to metastore for ↵Cheng Lian2015-07-2011-45/+221
| | | | | | | | | | | | | | | | | | | partition pruning This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details. [1]: https://github.com/liancheng/spark/commit/536ac41a7e6b2abeb1f6ec1a6491bbf09ed3e591 [2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391 [3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059 Author: Cheolsoo Park <cheolsoop@netflix.com> Author: Cheng Lian <lian@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits: 5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore 536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421
* [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal typeDavies Liu2015-07-202-39/+15
| | | | | | | | | | | | | | This PR also remove the duplicated code between registerFunction and UserDefinedFunction. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7450 from davies/fix_return_type and squashes the following commits: e80bf9f [Davies Liu] remove debugging code f94b1f6 [Davies Liu] fix mima 8f9c58b [Davies Liu] convert returned object from UDF into internal type
* [SQL] Remove space from DataFrame Scala/Java API.Reynold Xin2015-07-202-24/+0
| | | | | | | | | | I don't think this function is useful at all in Scala/Java, since users can easily compute n * space easily. Author: Reynold Xin <rxin@databricks.com> Closes #7530 from rxin/remove-space and squashes the following commits: c147873 [Reynold Xin] [SQL] Remove space from DataFrame Scala/Java API.
* [SPARK-9186][SQL] make deterministic describing the tree rather than the ↵Wenchen Fan2015-07-205-16/+22
| | | | | | | | | | expression Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7525 from cloud-fan/deterministic and squashes the following commits: 4189bfa [Wenchen Fan] make deterministic describing the tree rather than the expression
* [SPARK-9177][SQL] Reuse of calendar object in WeekOfYearTarek Auel2015-07-201-4/+11
| | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-9177 rxin Are we sure that this is thread safe? chenghao-intel explained in another PR that every partition (if I remember correctly) uses one expression instance. This instance isn't used by multiple threads, is it? If not, we are fine. Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7516 from tarekauel/SPARK-9177 and squashes the following commits: 0c1313a [Tarek Auel] [SPARK-9177] utilize more powerful addMutableState 6e2f03f [Tarek Auel] Merge branch 'master' into SPARK-9177 a69ec92 [Tarek Auel] [SPARK-9177] address comment 6cfb180 [Tarek Auel] [SPARK-9177] calendar as lazy transient val ff97b09 [Tarek Auel] [SPARK-9177] Reuse calendar object in interpreted code and codegen
* [SPARK-9153][SQL] codegen StringLPad/StringRPadTarek Auel2015-07-202-2/+58
| | | | | | | | | | | Jira: https://issues.apache.org/jira/browse/SPARK-9153 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7527 from tarekauel/SPARK-9153 and squashes the following commits: 3840c6b [Tarek Auel] [SPARK-9153] removed codegen fallback 92b6a5d [Tarek Auel] [SPARK-9153] codegen lpad/rpad
* [SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in ExchangeJosh Rosen2015-07-198-116/+398
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows. It also makes several general efficiency improvements to Exchange. Key changes: - When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle. After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs. This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs. - Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand. - When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows. This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes. Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks. Author: Josh Rosen <joshrosen@databricks.com> Closes #7456 from JoshRosen/unsafe-exchange and squashes the following commits: 7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite 0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter a27cfc1 [Josh Rosen] Add missing newline 741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange. 359c6a4 [Josh Rosen] Remove println() and add comments 93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange 8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them dd9c66d [Josh Rosen] Fix for copying logic 035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer 7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle cbea80b [Josh Rosen] Add UnsafeRowSerializer 0f2ac86 [Josh Rosen] Import ordering 3ca8515 [Josh Rosen] Big code simplification in Exchange 3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs
* [SQL][DOC] Minor document fix in HadoopFsRelationProviderJacky Li2015-07-191-2/+2
| | | | | | | | | | | | Catch this while reading the code Author: Jacky Li <lee.unreal@gmail.com> Author: Jacky Li <jackylk@users.noreply.github.com> Closes #7524 from jackylk/patch-11 and squashes the following commits: b679011 [Jacky Li] fix doc e10e211 [Jacky Li] [SQL] Minor document fix in HadoopFsRelationProvider
* Code review feedback for the previous patch.Reynold Xin2015-07-191-2/+6
|
* [SPARK-9185][SQL] improve code gen for mutable states to support complex ↵Wenchen Fan2015-07-199-38/+42
| | | | | | | | | | | | initialization Sometimes we need more than one step to initialize the mutable states in code gen like https://github.com/apache/spark/pull/7516 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7521 from cloud-fan/init and squashes the following commits: 2106445 [Wenchen Fan] improve code gen for mutable states
* [SPARK-9172][SQL] Make DecimalPrecision support for Intersect and ExceptLiang-Chi Hsieh2015-07-192-37/+104
| | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-9172 Simply make `DecimalPrecision` support for `Intersect` and `Except` in addition to `Union`. Besides, add unit test for `DecimalPrecision` as well. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7511 from viirya/more_decimalprecieion and squashes the following commits: 4d29d10 [Liang-Chi Hsieh] Fix code comment. 9fb0d49 [Liang-Chi Hsieh] Make DecimalPrecision support for Intersect and Except.
* [SPARK-8241][SQL] string function: concat_ws.Reynold Xin2015-07-198-20/+154
| | | | | | | | | | | | | | | I also changed the semantics of concat w.r.t. null back to the same behavior as Hive. That is to say, concat now returns null if any input is null. Author: Reynold Xin <rxin@databricks.com> Closes #7504 from rxin/concat_ws and squashes the following commits: 83fd950 [Reynold Xin] Fixed type casting. 3ae85f7 [Reynold Xin] Write null better. cdc7be6 [Reynold Xin] Added code generation for pure string mode. a61c4e4 [Reynold Xin] Updated comments. 2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.
* [SPARK-8638] [SQL] Window Function Performance Improvements - CleanupHerman van Hovell2015-07-193-85/+51
| | | | | | | | | | | | This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved. Git commit message is wrong BTW :(... Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits: 4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
* [HOTFIX] [SQL] Fixes compilation error introduced by PR #7506Cheng Lian2015-07-191-2/+2
| | | | | | | | | | PR #7506 breaks master build because of compilation error. Note that #7506 itself looks good, but it seems that `git merge` did something stupid. Author: Cheng Lian <lian@databricks.com> Closes #7510 from liancheng/hotfix-for-pr-7506 and squashes the following commits: 7ea7e89 [Cheng Lian] Fixes compilation error
* [SQL] Make date/time functions more consistent with other database systems.Reynold Xin2015-07-197-281/+225
| | | | | | | | | | | | | | | | | | | | | This pull request fixes some of the problems in #6981. - Added date functions to `__all__` so they get exposed - Rename day_of_month -> dayofmonth - Rename day_in_year -> dayofyear - Rename week_of_year -> weekofyear - Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL. Author: Reynold Xin <rxin@databricks.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #7506 from rxin/datetime and squashes the following commits: 0cb24d9 [Reynold Xin] Export all functions in Python. e44a4a0 [Reynold Xin] Removed day function from Scala and Python. 9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
* [SPARK-8199][SQL] follow up; revert change in testTarek Auel2015-07-191-2/+2
| | | | | | | | | | | | | | rxin / davies Sorry for that unnecessary change. And thanks again for all your support! Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7505 from tarekauel/SPARK-8199-FollowUp and squashes the following commits: d09321c [Tarek Auel] [SPARK-8199] follow up; revert change in test c17397f [Tarek Auel] [SPARK-8199] follow up; revert change in test 67acfe6 [Tarek Auel] [SPARK-8199] follow up; revert change in test
* [SPARK-8638] [SQL] Window Function Performance ImprovementsHerman van Hovell2015-07-184-404/+765
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## Description Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information. ## Improvements * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. * Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. * A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR. * A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. ## Benchmarking I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type: Frame | Master | SPARK-8638 ----- | ------ | ---------- Entire Frame | 2 s | 1 s Sliding | 18 s | 1 s Growing | 14 s | 0.9 s Shrinking | 13 s | 1 s Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits: 3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase) 2eb3b33 [Herman van Hovell] Corrected reverse range frame processing. 2cd2d5b [Herman van Hovell] Corrected reverse range frame processing. b0654d7 [Herman van Hovell] Tests for exotic frame specifications. e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code. 1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite. ac2f682 [Herman van Hovell] Added a few more comments. 1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods. bb020e6 [Herman van Hovell] Major overhaul of Window operator.
* Fixed test cases.Reynold Xin2015-07-181-2/+2
|