[SPARK-1592][streaming] Automatically remove streaming input blocks

The raw input data is stored as blocks in BlockManagers. Earlier they were cleared by cleaner ttl. Now since streaming does not require cleaner TTL to be set, the block would not get cleared. This increases up the Spark's memory usage, which is not even accounted and shown in the Spark storage UI. It may cause the data blocks to spill over to disk, which eventually slows down the receiving of data (persisting to memory become bottlenecked by writing to disk). The solution in this PR is to automatically remove those blocks. The mechanism to keep track of which BlockRDDs (which has presents the raw data blocks as a RDD) can be safely cleared already exists. Just use it to explicitly remove blocks from BlockRDDs. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #512 from tdas/block-rdd-unpersist and squashes the following commits: d25e610 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist 5f46d69 [Tathagata Das] Merge remote-tracking branch 'apache/master' into block-rdd-unpersist 2c320cd [Tathagata Das] Updated configuration with spark.streaming.unpersist setting. 2d4b2fd [Tathagata Das] Automatically removed input blocks
author: Tathagata Das <tathagata.das1565@gmail.com> 2014-04-24 18:18:22 -0700
committer: Tathagata Das <tathagata.das1565@gmail.com> 2014-04-24 18:18:22 -0700
commit: 526a518bf32ad55b926a26f16086f445fd0ae29f (patch)
tree: dc4bcf8aa155aae8fa5e5bdeb40b47c423745b9d /docs
parent: 35e3d199f04fba3230625002a458d43b9578b2e8 (diff)
download: spark-526a518bf32ad55b926a26f16086f445fd0ae29f.tar.gz
spark-526a518bf32ad55b926a26f16086f445fd0ae29f.tar.bz2
spark-526a518bf32ad55b926a26f16086f445fd0ae29f.zip
1 files changed, 5 insertions, 2 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index e7e1dd56cf..8d3442625b 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -469,10 +469,13 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td>spark.streaming.unpersist</td>
-  <td>false</td>
+  <td>true</td>
   <td>
     Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from
-    Spark's memory. Setting this to true is likely to reduce Spark's RDD memory usage.
+    Spark's memory. The raw input data received by Spark Streaming is also automatically cleared.
+    Setting this to false will allow the raw data and persisted RDDs to be accessible outside the
+    streaming application as they will not be cleared automatically. But it comes at the cost of
+    higher memory usage in Spark.
   </td>
 </tr>
 <tr>
author	Tathagata Das <tathagata.das1565@gmail.com>	2014-04-24 18:18:22 -0700
committer	Tathagata Das <tathagata.das1565@gmail.com>	2014-04-24 18:18:22 -0700
commit	526a518bf32ad55b926a26f16086f445fd0ae29f (patch)
tree	dc4bcf8aa155aae8fa5e5bdeb40b47c423745b9d /docs
parent	35e3d199f04fba3230625002a458d43b9578b2e8 (diff)
download	spark-526a518bf32ad55b926a26f16086f445fd0ae29f.tar.gz spark-526a518bf32ad55b926a26f16086f445fd0ae29f.tar.bz2 spark-526a518bf32ad55b926a26f16086f445fd0ae29f.zip