aboutsummaryrefslogtreecommitdiff
path: root/dev
diff options
context:
space:
mode:
authorCheng Lian <lian.cs.zju@gmail.com>2014-10-05 11:19:17 -0700
committerMichael Armbrust <michael@databricks.com>2014-10-05 11:19:17 -0700
commit1b97a941a09a2f63d442f435c1b444d857cd6956 (patch)
tree74359b8449230a2d33b3740c6bd7ea3f19defd04 /dev
parenta7c73130f1b6b0b8b19a7b0a0de5c713b673cd7b (diff)
downloadspark-1b97a941a09a2f63d442f435c1b444d857cd6956.tar.gz
spark-1b97a941a09a2f63d442f435c1b444d857cd6956.tar.bz2
spark-1b97a941a09a2f63d442f435c1b444d857cd6956.zip
[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions
This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x). The root cause is the semantics difference of `FileSystem.globStatus()` between different versions of Hadoop, as illustrated by the following test code: ```scala object GlobExperiments extends App { val conf = new Configuration() val fs = FileSystem.getLocal(conf) fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status => println(status.getPath) } } ``` Target directory structure: ``` /tmp/wh ├── dir0 │   ├── dir1 │   │   └── level2 │   └── level1 └── level0 ``` Hadoop 2.4.1 result: ``` file:/tmp/wh/dir0/dir1/level2 ``` Hadoop 1.0.4 resuet: ``` file:/tmp/wh/dir0/dir1/level2 file:/tmp/wh/dir0/level1 file:/tmp/wh/level0 ``` In #2226 and #2616, we call `FileOutputCommitter.commitJob()` at the end of the job, and the `_SUCCESS` mark file is written. When working with lower Hadoop versions, due to the `globStatus()` semantics issue, `_SUCCESS` is included as a separate partition data file by `Hive.loadDynamicPartitions()`, and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the `_SUCCESS` marker to workaround this issue. Hive doesn't suffer this issue because `FileSinkOperator` doesn't call `FileOutputCommitter.commitJob()`, instead, it calls `Utilities.mvFileToFinalPath()` to cleanup the output directory and then loads it into Hive warehouse by with `loadDynamicPartitions()`/`loadPartition()`/`loadTable()`. This approach is better because it handles failed job and speculative tasks properly. We should add this step to `InsertIntoHiveTable` in another PR. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2663 from liancheng/dp-hadoop-1-fix and squashes the following commits: 0177dae [Cheng Lian] Fixes dynamic partitioning support for lower Hadoop versions
Diffstat (limited to 'dev')
0 files changed, 0 insertions, 0 deletions