From 0dbd411a562396e024c513936fde46b0d2f6d59d Mon Sep 17 00:00:00 2001 From: Tathagata Das Date: Sun, 13 Jan 2013 21:08:35 -0800 Subject: Added documentation for PairDStreamFunctions. --- docs/streaming-programming-guide.md | 45 ++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 20 deletions(-) (limited to 'docs') diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 05a88ce7bd..b6da7af654 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -43,7 +43,7 @@ A complete list of input sources is available in the [StreamingContext API docum # DStream Operations -Once an input stream has been created, you can transform it using _stream operators_. Most of these operators return new DStreams which you can further transform. Eventually, you'll need to call an _output operator_, which forces evaluation of the stream by writing data out to an external source. +Once an input DStream has been created, you can transform it using _DStream operators_. Most of these operators return new DStreams which you can further transform. Eventually, you'll need to call an _output operator_, which forces evaluation of the DStream by writing data out to an external source. ## Transformations @@ -53,11 +53,11 @@ DStreams support many of the transformations available on normal Spark RDD's: TransformationMeaning map(func) - Return a new stream formed by passing each element of the source through a function func. + Returns a new DStream formed by passing each element of the source through a function func. filter(func) - Return a new stream formed by selecting those elements of the source on which func returns true. + Returns a new stream formed by selecting those elements of the source on which func returns true. flatMap(func) @@ -88,55 +88,60 @@ DStreams support many of the transformations available on normal Spark RDD's: cogroup(otherStream, [numTasks]) - When called on streams of type (K, V) and (K, W), returns a stream of (K, Seq[V], Seq[W]) tuples. This operation is also called groupWith. + When called on DStream of type (K, V) and (K, W), returns a DStream of (K, Seq[V], Seq[W]) tuples. reduce(func) - Create a new single-element stream by aggregating the elements of the stream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed correctly in parallel. + Returns a new DStream of single-element RDDs by aggregating the elements of the stream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed correctly in parallel. + + + transform(func) + Returns a new DStream by applying func (a RDD-to-RDD function) to every RDD of the stream. This can be used to do arbitrary RDD operations on the DStream. -Spark Streaming features windowed computations, which allow you to report statistics over a sliding window of data. All window functions take a windowTime, which represents the width of the window and a slideTime, which represents the frequency during which the window is calculated. +Spark Streaming features windowed computations, which allow you to report statistics over a sliding window of data. All window functions take a windowDuration, which represents the width of the window and a slideTime, which represents the frequency during which the window is calculated. - - + - - + - - + -
TransformationMeaning
window(windowTime, slideTime) Return a new stream which is computed based on windowed batches of the source stream. windowTime is the width of the window and slideTime is the frequency during which the window is calculated. Both times must be multiples of the batch interval. + window(windowDuration, slideTime) Return a new stream which is computed based on windowed batches of the source stream. windowDuration is the width of the window and slideTime is the frequency during which the window is calculated. Both times must be multiples of the batch interval.
countByWindow(windowTime, slideTime) Return a sliding count of elements in the stream. windowTime and slideTime are exactly as defined in window(). + countByWindow(windowDuration, slideTime) Return a sliding count of elements in the stream. windowDuration and slideDuration are exactly as defined in window().
reduceByWindow(func, windowTime, slideTime) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel. windowTime and slideTime are exactly as defined in window(). + reduceByWindow(func, windowDuration, slideDuration) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel. windowDuration and slideDuration are exactly as defined in window().
groupByKeyAndWindow(windowTime, slideTime, [numTasks]) + groupByKeyAndWindow(windowDuration, slideDuration, [numTasks]) When called on a stream of (K, V) pairs, returns a stream of (K, Seq[V]) pairs over a sliding window.
-Note: By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. windowTime and slideTime are exactly as defined in window(). +Note: By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. windowDuration and slideDuration are exactly as defined in window().
reduceByKeyAndWindow(func, [numTasks]) When called on a stream of (K, V) pairs, returns a stream of (K, V) pairs where the values for each key are aggregated using the given reduce function over batches within a sliding window. Like in groupByKeyAndWindow, the number of reduce tasks is configurable through an optional second argument. - windowTime and slideTime are exactly as defined in window(). + windowDuration and slideDuration are exactly as defined in window().
countByKeyAndWindow([numTasks]) When called on a stream of (K, V) pairs, returns a stream of (K, Int) pairs where the values for each key are the count within a sliding window. Like in countByKeyAndWindow, the number of reduce tasks is configurable through an optional second argument. - windowTime and slideTime are exactly as defined in window(). + windowDuration and slideDuration are exactly as defined in window().
+A complete list of DStream operations is available in the API documentation of [DStream](api/streaming/index.html#spark.streaming.DStream) and [PairDStreamFunctions](api/streaming/index.html#spark.streaming.PairDStreamFunctions). ## Output Operations When an output operator is called, it triggers the computation of a stream. Currently the following output operators are defined: @@ -144,7 +149,7 @@ When an output operator is called, it triggers the computation of a stream. Curr - + @@ -155,18 +160,18 @@ When an output operator is called, it triggers the computation of a stream. Curr - - + - +
OperatorMeaning
foreachRDD(func) foreach(func) The fundamental output operator. Applies a function, func, to each RDD generated from the stream. This function should have side effects, such as printing output, saving the RDD to external files, or writing it over the network to an external system.
saveAsObjectFiles(prefix, [suffix]) Save this DStream's contents as a SequenceFile of serialized objects. The file name at each batch interval is calculated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". + Save this DStream's contents as a SequenceFile of serialized objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsTextFiles(prefix, [suffix]) Save this DStream's contents as a text files. The file name at each batch interval is calculated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". Save this DStream's contents as a text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix]) Save this DStream's contents as a Hadoop file. The file name at each batch interval is calculated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". Save this DStream's contents as a Hadoop file. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
-- cgit v1.2.3