From 07f261876930a07161a7bcdb713430113e9e9ec8 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Mon, 25 Feb 2013 15:09:46 -0800 Subject: Minor changes based on feedback --- docs/streaming-programming-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'docs') diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index d58d030ebd..4b351d07e5 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -286,9 +286,9 @@ For input streams that receive data from the network (that is, subclasses of Net Note that, unlike RDDs, the default persistence level of DStreams keeps the data serialized in memory. This is further discussed in the [Performance Tuning](#memory-tuning) section. More information on different persistence levels can be found in [Spark Programming Guide](scala-programming-guide.html#rdd-persistence). # RDD Checkpointing within DStreams -A _stateful operation_ is one which operates over multiple batches of data. This includes all window-based operations and the `updateStateByKey` operation. Because stateful operations have an infinitely growing lineage, their state must be periodically checkpointed to HDFS. +A _stateful operation_ is one which operates over multiple batches of data. This includes all window-based operations and the `updateStateByKey` operation. -Checkpointing prevents extra computation during both failure recovery and normal operations by transparently truncating this lineage. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try. +Because stateful operations have an infinitely growing lineage, their state must be periodically checkpointed to HDFS. Checkpointing truncates the lineage graph and stores the of the DStream in HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try. To enable checkpointing, the developer has to provide the HDFS path to which RDD will be saved. This is done by using -- cgit v1.2.3 From 918ee258673b2364730fe38fe781cb5167870e7a Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Mon, 25 Feb 2013 15:24:17 -0800 Subject: One more change done with TD --- docs/streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'docs') diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 4b351d07e5..e1407a49b0 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -288,7 +288,7 @@ Note that, unlike RDDs, the default persistence level of DStreams keeps the data # RDD Checkpointing within DStreams A _stateful operation_ is one which operates over multiple batches of data. This includes all window-based operations and the `updateStateByKey` operation. -Because stateful operations have an infinitely growing lineage, their state must be periodically checkpointed to HDFS. Checkpointing truncates the lineage graph and stores the of the DStream in HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try. +Because stateful operations have a dependency on previous batches of data, they continuously accumulate metadata over time. To clear this meta-data, streaming supports periodic _checkpointing_ by saving intermediate data to HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try. To enable checkpointing, the developer has to provide the HDFS path to which RDD will be saved. This is done by using -- cgit v1.2.3 From 8316534eefe7e6b2d109ead53cf5f87c8cd50388 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Mon, 25 Feb 2013 15:27:04 -0800 Subject: meta-data --- docs/streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'docs') diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index e1407a49b0..de2fa569d0 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -288,7 +288,7 @@ Note that, unlike RDDs, the default persistence level of DStreams keeps the data # RDD Checkpointing within DStreams A _stateful operation_ is one which operates over multiple batches of data. This includes all window-based operations and the `updateStateByKey` operation. -Because stateful operations have a dependency on previous batches of data, they continuously accumulate metadata over time. To clear this meta-data, streaming supports periodic _checkpointing_ by saving intermediate data to HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try. +Because stateful operations have a dependency on previous batches of data, they continuously accumulate metadata over time. To clear this metadata, streaming supports periodic _checkpointing_ by saving intermediate data to HDFS. Note that checkpointing also incurs the cost of saving to HDFS which may cause the corresponding batch to take longer to process. Hence, the interval of checkpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing every batch may significantly reduce operation throughput. Conversely, checkpointing too slowly causes the lineage and task sizes to grow which may have detrimental effects. Typically, a checkpoint interval of 5 - 10 times of sliding interval of a DStream is good setting to try. To enable checkpointing, the developer has to provide the HDFS path to which RDD will be saved. This is done by using -- cgit v1.2.3