diff options
author | Cheng Lian <lian.cs.zju@gmail.com> | 2014-10-03 12:26:02 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-10-03 12:26:02 -0700 |
commit | bec0d0eaa33811fde72b84f7d53a6f6031e7b5d3 (patch) | |
tree | 5a5d3be1918748b19142c2686e96315d57a6a4d8 /docs/mllib-optimization.md | |
parent | fbe8e9856b23262193105e7bf86075f516f0db25 (diff) | |
download | spark-bec0d0eaa33811fde72b84f7d53a6f6031e7b5d3.tar.gz spark-bec0d0eaa33811fde72b84f7d53a6f6031e7b5d3.tar.bz2 spark-bec0d0eaa33811fde72b84f7d53a6f6031e7b5d3.zip |
[SPARK-3007][SQL] Adds dynamic partitioning support
PR #2226 was reverted because it broke Jenkins builds for unknown reason. This debugging PR aims to fix the Jenkins build.
This PR also fixes two bugs:
1. Compression configurations in `InsertIntoHiveTable` are disabled by mistake
The `FileSinkDesc` object passed to the writer container doesn't have compression related configurations. These configurations are not taken care of until `saveAsHiveFile` is called. This PR moves compression code forward, right after instantiation of the `FileSinkDesc` object.
1. `PreInsertionCasts` doesn't take table partitions into account
In `castChildOutput`, `table.attributes` only contains non-partition columns, thus for partitioned table `childOutputDataTypes` never equals to `tableOutputDataTypes`. This results funny analyzed plan like this:
```
== Analyzed Logical Plan ==
InsertIntoTable Map(partcol1 -> None, partcol2 -> None), false
MetastoreRelation default, dynamic_part_table, None
Project [c_0#1164,c_1#1165,c_2#1166]
Project [c_0#1164,c_1#1165,c_2#1166]
Project [c_0#1164,c_1#1165,c_2#1166]
... (repeats 99 times) ...
Project [c_0#1164,c_1#1165,c_2#1166]
Project [c_0#1164,c_1#1165,c_2#1166]
Project [1 AS c_0#1164,1 AS c_1#1165,1 AS c_2#1166]
Filter (key#1170 = 150)
MetastoreRelation default, src, None
```
Awful though this logical plan looks, it's harmless because all projects will be eliminated by optimizer. Guess that's why this issue hasn't been caught before.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: baishuo(白硕) <vc_java@hotmail.com>
Author: baishuo <vc_java@hotmail.com>
Closes #2616 from liancheng/dp-fix and squashes the following commits:
21935b6 [Cheng Lian] Adds back deleted trailing space
f471c4b [Cheng Lian] PreInsertionCasts should take table partitions into account
a132c80 [Cheng Lian] Fixes output compression
9c6eb2d [Cheng Lian] Adds tests to verify dynamic partitioning folder layout
0eed349 [Cheng Lian] Addresses @yhuai's comments
26632c3 [Cheng Lian] Adds more tests
9227181 [Cheng Lian] Minor refactoring
c47470e [Cheng Lian] Refactors InsertIntoHiveTable to a Command
6fb16d7 [Cheng Lian] Fixes typo in test name, regenerated golden answer files
d53daa5 [Cheng Lian] Refactors dynamic partitioning support
b821611 [baishuo] pass check style
997c990 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name
761ecf2 [baishuo] modify according micheal's advice
207c6ac [baishuo] modify for some bad indentation
caea6fb [baishuo] modify code to pass scala style checks
b660e74 [baishuo] delete a empty else branch
cd822f0 [baishuo] do a little modify
8e7268c [baishuo] update file after test
3f91665 [baishuo(白硕)] Update Cast.scala
8ad173c [baishuo(白硕)] Update InsertIntoHiveTable.scala
051ba91 [baishuo(白硕)] Update Cast.scala
d452eb3 [baishuo(白硕)] Update HiveQuerySuite.scala
37c603b [baishuo(白硕)] Update InsertIntoHiveTable.scala
98cfb1f [baishuo(白硕)] Update HiveCompatibilitySuite.scala
6af73f4 [baishuo(白硕)] Update InsertIntoHiveTable.scala
adf02f1 [baishuo(白硕)] Update InsertIntoHiveTable.scala
1867e23 [baishuo(白硕)] Update SparkHadoopWriter.scala
6bb5880 [baishuo(白硕)] Update HiveQl.scala
Diffstat (limited to 'docs/mllib-optimization.md')
0 files changed, 0 insertions, 0 deletions