diff options
author | CodingCat <zhunansjtu@gmail.com> | 2014-03-01 17:27:54 -0800 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-03-01 17:27:54 -0800 |
commit | 3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 (patch) | |
tree | eecd608e4a856adf94ba8c901a53fd67410ffdf4 /sbt | |
parent | fe195ae113941766b3921b1e4ec222ed830b5b8f (diff) | |
download | spark-3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1.tar.gz spark-3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1.tar.bz2 spark-3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1.zip |
[SPARK-1100] prevent Spark from overwriting directory silently
Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100)
the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because
overwriting the data silently is not user-friendly
if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings
My fix includes:
add some new APIs with a flag for users to define whether he/she wants to overwrite the directory:
if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running;
if the flag is set to false, Spark will throw an exception if the output directory already exists
changed JavaAPI part
default behaviour is overwriting
Two questions
should we deprecate the old APIs without such a flag?
I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas
Author: CodingCat <zhunansjtu@gmail.com>
Closes #11 from CodingCat/SPARK-1100 and squashes the following commits:
6a4e3a3 [CodingCat] code clean
ef2d43f [CodingCat] add new test cases and code clean
ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat
ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory
Diffstat (limited to 'sbt')
0 files changed, 0 insertions, 0 deletions