| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
| |
This reverts commit 5247dd859b95a440baa562b9827bdeb26aa6530e.
|
|
|
|
| |
This reverts commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f.
|
| |
|
| |
|
|
|
|
| |
This reverts commit db7f4a898af22a02b36428507f8ef2b429d78dc1.
|
|
|
|
| |
This reverts commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b.
|
| |
|
| |
|
|
|
|
| |
This reverts commit 38c1fbd9694430cefd962c90bc36b0d108c6124b.
|
|
|
|
| |
This reverts commit d7ac6013483e83caff8ea54c228f37aeca159db8.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
```timeTaken``` should not count the time of printing result.
Author: w00228970 <wangfei1@huawei.com>
Closes #3423 from scwf/time-taken-bug and squashes the following commits:
da7e102 [w00228970] compute time taken correctly
(cherry picked from commit 723be60e233d0f85944d948efd06845ef546c9f5)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Re-implement the Python broadcast using file:
1) serialize the python object using cPickle, write into disks.
2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
4) During deserialization, writing the data into disk.
5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 25.20 | 9.33 | 170.13% |
python-broadcast-w-set | 4.13 | 4.50 | -8.35% |
Testing with 100 tasks (16 CPUs):
name | 1.1 | 1.2 with this patch | improvement
---------|--------|---------|--------
python-broadcast-w-bytes | 38.16 | 8.40 | 353.98%
python-broadcast-w-set | 23.29 | 9.59 | 142.80%
Author: Davies Liu <davies@databricks.com>
Closes #3417 from davies/pybroadcast and squashes the following commits:
50a58e0 [Davies Liu] address comments
b98de1d [Davies Liu] disable gc while unpickle
e5ee6b9 [Davies Liu] support large string
09303b8 [Davies Liu] read all data into memory
dde02dd [Davies Liu] improve performance of python broadcast
(cherry picked from commit 6cf507685efd01df77d663145ae08e48c7f92948)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
And then, attributes referenced at ORDER BY clause are resolved (2).
But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.
SELECT c1 + c2 FROM mytable ORDER BY c1;
The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #3363 from sarutak/SPARK-4487 and squashes the following commits:
fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
6e60c20 [Kousuke Saruta] Fixed conflicts
cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
282d529 [Kousuke Saruta] Fixed attributes reference resolution error
b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
317b7fb [Kousuke Saruta] WIP
(cherry picked from commit dd1c9cb36cde8202cede8014b5641ae8a0197812)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This file is for Hive 0.13.1 I think.
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes #3432 from darabos/patch-2 and squashes the following commits:
4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1.
(cherry picked from commit d5834f0732b586731034a7df5402c25454770fc5)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
shuffle is on
This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`,
1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and
2. avoids defensive copies in `Exchange` operator
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits:
591f2e9 [Cheng Lian] Passes all shuffle suites
0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed
ed5df3c [Cheng Lian] Fixes styling changes
f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on
(cherry picked from commit a6d7b61f92dc7c1f9632cecb232afa8040ab2b4d)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is just a quick fix for 1.2. SPARK-4523 describes a more complete solution.
Author: Michael Armbrust <michael@databricks.com>
Closes #3392 from marmbrus/parquetMetadata and squashes the following commits:
bcc6626 [Michael Armbrust] Parse schema with missing metadata.
(cherry picked from commit 90a6a46bd11030672597f015dd443d954107123a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Goals:
- Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns)
- Support for folder based partitioning with automatic discovery of available partitions
- Caching of file metadata
See scaladoc of `ParquetRelation2` for more details.
Author: Michael Armbrust <michael@databricks.com>
Closes #3269 from marmbrus/newParquet and squashes the following commits:
1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
645768b [Michael Armbrust] Review comments.
abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API.
938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions.
e9d2641 [Michael Armbrust] logging / formatting improvements.
(cherry picked from commit 02ec058efe24348cdd3691b55942e6f0ef138732)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
parameters
Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding).
This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization).
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3109 from chenghao-intel/optimized and squashes the following commits:
487ff79 [Cheng Hao] rebase to the latest master & update the unittest
(cherry picked from commit 84d79ee9ec47465269f7b0a7971176da93c96f3f)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sample code in the description of SchemaRDD.where is not correct
Author: Jacky Li <jacky.likun@gmail.com>
Closes #3344 from jackylk/patch-6 and squashes the following commits:
62cd126 [Jacky Li] [SQL] fix function description mistake
(cherry picked from commit ad5f1f3ca240473261162c06ffc5aa70d15a5991)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3357 from chenghao-intel/explain and squashes the following commits:
7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
(cherry picked from commit 6aa0fc9f4d95f09383cbcb5f79166c60697e6683)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits:
8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
a5a57d2 [Takuya UESHIN] Fix empty Average.
22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
(cherry picked from commit 2c2e7a44db2ebe44121226f3eac924a0668b991a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL
Author: ravipesala <ravindra.pesala@huawei.com>
Closes #3387 from ravipesala/<=> and squashes the following commits:
7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
(cherry picked from commit 98e9419784a9ad5096cfd563fa9a433786a90bd4)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here's a simple fix for SchemaRDD to JSON.
Author: Dan McClary <dan.mcclary@gmail.com>
Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:
d714e1d [Dan McClary] fixed PEP 8 error
cac2879 [Dan McClary] move pyspark comment and doctest to correct location
f9471d3 [Dan McClary] added pyspark doc and doctest
6598cee [Dan McClary] adding complex type queries
1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
4a651f0 [Dan McClary] cleaned PEP and Scala style failures. Moved tests to JsonSuite
47ceff6 [Dan McClary] cleaned up scala style issues
2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
1b11980 [Dan McClary] Map and UserDefinedTypes partially done
11d2016 [Dan McClary] formatting and unicode deserialization default fixed
6af72d1 [Dan McClary] deleted extaneous comment
4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
149dafd [Dan McClary] wrapped scala toJSON in sql.py
5e5eb1b [Dan McClary] switched to Jackson for JSON processing
6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
1d171aa [Dan McClary] upated missing brace on if statement
319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
424f130 [Dan McClary] tests pass, ready for pull and PR
626a5b1 [Dan McClary] added toJSON to SchemaRDD
f7d166a [Dan McClary] added toJSON method
5d34e37 [Dan McClary] merge resolved
d6d19e9 [Dan McClary] pr example
(cherry picked from commit b8e6886fb8ff8f667fb7e600cd727d8649cad1d1)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD.
Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example:
```sql
CACHE TABLE first AS SELECT * FROM src;
CACHE TABLE second AS SELECT * FROM src;
```
The Web UI only shows "In-memory table first".
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits:
071907f [Cheng Lian] Fixes tests
12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name
(cherry picked from commit abf29187f0342b607fcefe269391d4db58d2a957)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
with literals on the left hand side
For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.
(This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:
0130897 [Cheng Lian] Fixes Parquet comparison filter generation
(cherry picked from commit 423baea953996a66dde671ff6db2fb1f32fbe8cb)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch will bring support for broadcasting objects larger than 2G.
pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].
Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Closes #2659 from davies/huge and squashes the following commits:
7b57a14 [Davies Liu] add more tests for broadcast
28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
a2f6a02 [Davies Liu] bug fix
4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
5875c73 [Davies Liu] address comments
10a349b [Davies Liu] address comments
0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
6182c8f [Davies Liu] Merge branch 'master' into huge
d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
2514848 [Davies Liu] address comments
fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
1c2d928 [Davies Liu] fix scala style
091b107 [Davies Liu] broadcast objects larger than 2G
(cherry picked from commit 4a377aff2d36b64a65b54192a987aba44b8f78e0)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
directory and the file
Author: Michael Armbrust <michael@databricks.com>
Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits:
447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
(cherry picked from commit 90d72ec8502f7ec11d2fe42f08c884ad2159266f)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.
While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L213-L228)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.
The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:
d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
(cherry picked from commit 36b0956a3eadc7343ed0d25c79a6ce0496eaaccd)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits:
156b500 [Cheng Hao] rebase the master
c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector
(cherry picked from commit ef7c464effa1510b24bd8e665e4df6c4839b0c87)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The `totalSize` of external table is always zero, which will influence join strategy(always use broadcast join for external table).
Author: w00228970 <wangfei1@huawei.com>
Closes #3304 from scwf/statistics and squashes the following commits:
568f321 [w00228970] fix statistics for external table
(cherry picked from commit 42389b1780311d90499b4ce2315ceabf5b6ab384)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
for complex types
This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits:
866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile)
6f71d0b [Cheng Lian] Makes toHiveString static
26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
a92882a [Cheng Lian] Updates HiveShim for 0.13.1
73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
(cherry picked from commit 6b7f2f753d16ff038881772f1958e3f4fd5597a7)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Cheng Hao <hao.cheng@intel.com>
Closes #3217 from chenghao-intel/mutablerow and squashes the following commits:
e8a10bd [Cheng Hao] revert the change of Row object
4681aea [Cheng Hao] Add toMutableRow method in object Row
a751838 [Cheng Hao] Construct the MutableRow from an existed row
(cherry picked from commit 69e858cc7748b6babadd0cbe20e65f3982161cbf)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3283 from ueshin/issues/SPARK-4425 and squashes the following commits:
14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType.
(cherry picked from commit 566c791931645bfaaaf57ee5a15b9ffad534f81e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DecimalType.
This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256).
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #3278 from ueshin/issues/SPARK-4420 and squashes the following commits:
7fea558 [Takuya UESHIN] Add some tests.
cb2301a [Takuya UESHIN] Fix tests.
133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType.
(cherry picked from commit 3a81a1c9e0963173534d96850f3c0b7a16350838)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is inspired by the [Parquet record filter generation code](https://github.com/apache/spark/blob/64c6b9bad559c21f25cd9fbe37c8813cdab939f2/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala#L387-L400).
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits:
78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive
(cherry picked from commit 5ce7dae859dc273b0fc532c9456b5960b1eca399)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
| |
|
| |
|
|
|
|
| |
This reverts commit bc09875799aa373f4320d38b02618173ffa4c96f.
|
|
|
|
| |
This reverts commit 6c6fd218c83a049c874b8a0ea737333c1899c94a.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds a new operator that uses Spark's `ExternalSort` class. It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance.
Author: Michael Armbrust <michael@databricks.com>
Closes #3268 from marmbrus/externalSort and squashes the following commits:
48b9726 [Michael Armbrust] comments
b98799d [Michael Armbrust] Add test
afd7562 [Michael Armbrust] Add support for external sort.
(cherry picked from commit 64c6b9bad559c21f25cd9fbe37c8813cdab939f2)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and fixes for complex types"
Author: Michael Armbrust <michael@databricks.com>
Closes #3292 from marmbrus/revert4309 and squashes the following commits:
808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
(cherry picked from commit 45ce3273cb618d14ec4d20c4c95699634b951086)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
for complex types
SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3178)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes #3178 from liancheng/date-for-thriftserver and squashes the following commits:
6f71d0b [Cheng Lian] Makes toHiveString static
26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
a92882a [Cheng Lian] Updates HiveShim for 0.13.1
73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
(cherry picked from commit cb6bd83a91d9b4a227dc6467255231869c1820e2)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The symbol of BitwiseOr is defined as '&' but I think it's wrong. It should be '|'.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #3284 from sarutak/bitwise-or-symbol-fix and squashes the following commits:
aff4be5 [Kousuke Saruta] Fixed symbol of BitwiseOr
(cherry picked from commit 84468b2e2031d646dcf035cb18947170ba326ccd)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add contains(key) to org.apache.spark.sql.catalyst.util.Metadata to test the existence of a key. Otherwise, Class Metadata's get methods may throw NoSuchElement exception if the key does not exist.
Testcases are added to MetadataSuite as well.
Author: kai <kaizeng@eecs.berkeley.edu>
Closes #3273 from kai-zeng/metadata-fix and squashes the following commits:
74b3d03 [kai] Added contains(key) to Metadata
(cherry picked from commit cbddac23696d89b672dce380cc7360a873e27b3b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect.
ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer.
The "fix" would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging.
Author: Jim Carroll <jim@dontcallme.com>
Closes #3271 from jimfcarroll/parquet-logging and squashes the following commits:
37bdff7 [Jim Carroll] Fix Spark's control of Parquet logging.
(cherry picked from commit 37482ce5a7b875f17d32a5e8c561cc8e9772c9b3)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
parquet library
Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those :
from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
public boolean nextKeyValue() throws IOException, InterruptedException {
boolean recordFound = false;
while (!recordFound) {
// no more records left
if (current >= total)
{ return false; }
try {
checkRead();
currentValue = recordReader.read();
current ++;
if (recordReader.shouldSkipCurrentRecord())
{
// this record is being filtered via the filter2 package
if (DEBUG) LOG.debug("skipping record");
continue;
}
if (currentValue == null)
{
// only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar;
if (DEBUG) LOG.debug("filtered record reader reached end of block");
continue;
}
recordFound = true;
if (DEBUG) LOG.debug("read value: " + currentValue);
} catch (RuntimeException e)
{ throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); }
}
return true;
}
Author: Yash Datta <Yash.Datta@guavus.com>
Closes #3229 from saucam/remove_filter and squashes the following commits:
8909ae9 [Yash Datta] SPARK-4365: Remove unnecessary filter call on records returned from parquet library
(cherry picked from commit 63ca3af66f9680fd12adee82fb4d342caae5cea4)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|