diff options
author | Cheng Lian <lian.cs.zju@gmail.com> | 2014-10-05 11:19:17 -0700 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2014-10-05 11:19:17 -0700 |
commit | 1b97a941a09a2f63d442f435c1b444d857cd6956 (patch) | |
tree | 74359b8449230a2d33b3740c6bd7ea3f19defd04 /dev | |
parent | a7c73130f1b6b0b8b19a7b0a0de5c713b673cd7b (diff) | |
download | spark-1b97a941a09a2f63d442f435c1b444d857cd6956.tar.gz spark-1b97a941a09a2f63d442f435c1b444d857cd6956.tar.bz2 spark-1b97a941a09a2f63d442f435c1b444d857cd6956.zip |
[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions
This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x).
The root cause is the semantics difference of `FileSystem.globStatus()` between different versions of Hadoop, as illustrated by the following test code:
```scala
object GlobExperiments extends App {
val conf = new Configuration()
val fs = FileSystem.getLocal(conf)
fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status =>
println(status.getPath)
}
}
```
Target directory structure:
```
/tmp/wh
├── dir0
│ ├── dir1
│ │ └── level2
│ └── level1
└── level0
```
Hadoop 2.4.1 result:
```
file:/tmp/wh/dir0/dir1/level2
```
Hadoop 1.0.4 resuet:
```
file:/tmp/wh/dir0/dir1/level2
file:/tmp/wh/dir0/level1
file:/tmp/wh/level0
```
In #2226 and #2616, we call `FileOutputCommitter.commitJob()` at the end of the job, and the `_SUCCESS` mark file is written. When working with lower Hadoop versions, due to the `globStatus()` semantics issue, `_SUCCESS` is included as a separate partition data file by `Hive.loadDynamicPartitions()`, and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the `_SUCCESS` marker to workaround this issue.
Hive doesn't suffer this issue because `FileSinkOperator` doesn't call `FileOutputCommitter.commitJob()`, instead, it calls `Utilities.mvFileToFinalPath()` to cleanup the output directory and then loads it into Hive warehouse by with `loadDynamicPartitions()`/`loadPartition()`/`loadTable()`. This approach is better because it handles failed job and speculative tasks properly. We should add this step to `InsertIntoHiveTable` in another PR.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #2663 from liancheng/dp-hadoop-1-fix and squashes the following commits:
0177dae [Cheng Lian] Fixes dynamic partitioning support for lower Hadoop versions
Diffstat (limited to 'dev')
0 files changed, 0 insertions, 0 deletions