[SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs

This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery. Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat. In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output. This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable. Author: Josh Rosen <joshrosen@databricks.com> Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits: 36eaf35 [Josh Rosen] Add comment explaining use of transform() in test. 6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform() 7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming. e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic. 762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
author: Josh Rosen <joshrosen@databricks.com> 2015-01-04 20:26:18 -0800
committer: Tathagata Das <tathagata.das1565@gmail.com> 2015-01-04 20:26:18 -0800
commit: 939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1 (patch)
tree: 42259c4f15027fdda43ea817eea5feee19d48486 /docs
parent: e767d7ddac5c2330af553f2a74b8575dfc7afb67 (diff)
download: spark-939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1.tar.gz
spark-939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1.tar.bz2
spark-939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1.zip
1 files changed, 3 insertions, 1 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index fa9d311f85..9bb6499993 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -709,7 +709,9 @@ Apart from these, the following properties are also available, and may be useful
     <td>If set to true, validates the output specification (e.g. checking if the output directory already exists)
     used in saveAsHadoopFile and other variants. This can be disabled to silence exceptions due to pre-existing
     output directories. We recommend that users do not disable this except if trying to achieve compatibility with
-    previous versions of Spark. Simply use Hadoop's FileSystem API to delete output directories by hand.</td>
+    previous versions of Spark. Simply use Hadoop's FileSystem API to delete output directories by hand.
+    This setting is ignored for jobs generated through Spark Streaming's StreamingContext, since
+    data may need to be rewritten to pre-existing output directories during checkpoint recovery.</td>
 </tr>
 <tr>
     <td><code>spark.hadoop.cloneConf</code></td>
author	Josh Rosen <joshrosen@databricks.com>	2015-01-04 20:26:18 -0800
committer	Tathagata Das <tathagata.das1565@gmail.com>	2015-01-04 20:26:18 -0800
commit	939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1 (patch)
tree	42259c4f15027fdda43ea817eea5feee19d48486 /docs
parent	e767d7ddac5c2330af553f2a74b8575dfc7afb67 (diff)
download	spark-939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1.tar.gz spark-939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1.tar.bz2 spark-939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1.zip