[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files

Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <sowen@cloudera.com> Closes #6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
author: Sean Owen <sowen@cloudera.com> 2015-06-19 11:03:04 -0700
committer: Andrew Or <andrew@databricks.com> 2015-06-19 11:03:04 -0700
commit: 4be53d0395d3c7f61eef6b7d72db078e2e1199a7 (patch)
tree: d8f64db7a270db524ffa209a4e1a0eda61de4f88 /docs/programming-guide.md
parent: 68a2dca292776d4a3f988353ba55adc73a7c1aa2 (diff)
download: spark-4be53d0395d3c7f61eef6b7d72db078e2e1199a7.tar.gz
spark-4be53d0395d3c7f61eef6b7d72db078e2e1199a7.tar.bz2
spark-4be53d0395d3c7f61eef6b7d72db078e2e1199a7.zip
1 files changed, 5 insertions, 3 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index d5ff416fe8..ae712d6274 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
 to disk, incurring the additional overhead of disk I/O and increased garbage collection.
 
 Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
-are not cleaned up from Spark's temporary storage until Spark is stopped, which means that
-long-running Spark jobs may consume available disk space. This is done so the shuffle doesn't need
-to be re-computed if the lineage is re-computed. The temporary storage directory is specified by the
+are preserved until the corresponding RDDs are no longer used and are garbage collected. 
+This is done so the shuffle files don't need to be re-created if the lineage is re-computed. 
+Garbage collection may happen only after a long period time, if the application retains references 
+to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may 
+consume a large amount of disk space. The temporary storage directory is specified by the
 `spark.local.dir` configuration parameter when configuring the Spark context.
 
 Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the
author	Sean Owen <sowen@cloudera.com>	2015-06-19 11:03:04 -0700
committer	Andrew Or <andrew@databricks.com>	2015-06-19 11:03:04 -0700
commit	4be53d0395d3c7f61eef6b7d72db078e2e1199a7 (patch)
tree	d8f64db7a270db524ffa209a4e1a0eda61de4f88 /docs/programming-guide.md
parent	68a2dca292776d4a3f988353ba55adc73a7c1aa2 (diff)
download	spark-4be53d0395d3c7f61eef6b7d72db078e2e1199a7.tar.gz spark-4be53d0395d3c7f61eef6b7d72db078e2e1199a7.tar.bz2 spark-4be53d0395d3c7f61eef6b7d72db078e2e1199a7.zip