aboutsummaryrefslogtreecommitdiff
path: root/docs/streaming-programming-guide.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/streaming-programming-guide.md')
-rw-r--r--docs/streaming-programming-guide.md56
1 files changed, 28 insertions, 28 deletions
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index f9904d4501..946d6c4879 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -40,7 +40,7 @@ Spark Streaming provides a high-level abstraction called *discretized stream* or
which represents a continuous stream of data. DStreams can be created either from input data
stream from sources such as Kafka and Flume, or by applying high-level
operations on other DStreams. Internally, a DStream is represented as a sequence of
-[RDDs](api/core/index.html#org.apache.spark.rdd.RDD).
+[RDDs](api/scala/index.html#org.apache.spark.rdd.RDD).
This guide shows you how to start writing Spark Streaming programs with DStreams. You can
write Spark Streaming programs in Scala or Java, both of which are presented in this guide. You
@@ -62,7 +62,7 @@ First, we import the names of the Spark Streaming classes, and some implicit
conversions from StreamingContext into our environment, to add useful methods to
other classes we need (like DStream).
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) is the
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) is the
main entry point for all streaming functionality.
{% highlight scala %}
@@ -71,7 +71,7 @@ import org.apache.spark.streaming.StreamingContext._
{% endhighlight %}
Then we create a
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) object.
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) object.
Besides Spark's configuration, we specify that any DStream will be processed
in 1 second batches.
@@ -132,7 +132,7 @@ The complete code can be found in the Spark Streaming example
<div data-lang="java" markdown="1">
First, we create a
-[JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object,
+[JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object,
which is the main entry point for all streaming
functionality. Besides Spark's configuration, we specify that any DStream would be processed
in 1 second batches.
@@ -168,7 +168,7 @@ JavaDStream<String> words = lines.flatMap(
generating multiple new records from each record in the source DStream. In this case,
each line will be split into multiple words and the stream of words is represented as the
`words` DStream. Note that we defined the transformation using a
-[FlatMapFunction](api/core/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
+[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
As we will discover along the way, there are a number of such convenience classes in the Java API
that help define DStream transformations.
@@ -192,9 +192,9 @@ wordCounts.print(); // Print a few of the counts to the console
{% endhighlight %}
The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word,
-1)` pairs, using a [PairFunction](api/core/index.html#org.apache.spark.api.java.function.PairFunction)
+1)` pairs, using a [PairFunction](api/scala/index.html#org.apache.spark.api.java.function.PairFunction)
object. Then, it is reduced to get the frequency of words in each batch of data,
-using a [Function2](api/core/index.html#org.apache.spark.api.java.function.Function2) object.
+using a [Function2](api/scala/index.html#org.apache.spark.api.java.function.Function2) object.
Finally, `wordCounts.print()` will print a few of the counts generated every second.
Note that when these lines are executed, Spark Streaming only sets up the computation it
@@ -333,7 +333,7 @@ for the full list of supported sources and artifacts.
<div data-lang="scala" markdown="1">
To initialize a Spark Streaming program in Scala, a
-[`StreamingContext`](api/streaming/index.html#org.apache.spark.streaming.StreamingContext)
+[`StreamingContext`](api/scala/index.html#org.apache.spark.streaming.StreamingContext)
object has to be created, which is the main entry point of all Spark Streaming functionality.
A `StreamingContext` object can be created by using
@@ -344,7 +344,7 @@ new StreamingContext(master, appName, batchDuration, [sparkHome], [jars])
<div data-lang="java" markdown="1">
To initialize a Spark Streaming program in Java, a
-[`JavaStreamingContext`](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
+[`JavaStreamingContext`](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
object has to be created, which is the main entry point of all Spark Streaming functionality.
A `JavaStreamingContext` object can be created by using
@@ -431,8 +431,8 @@ and process any files created in that directory. Note that
For more details on streams from files, Akka actors and sockets,
see the API documentations of the relevant functions in
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) for
-Scala and [JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for
+Scala and [JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
for Java.
Additional functionality for creating DStreams from sources such as Kafka, Flume, and Twitter
@@ -802,10 +802,10 @@ output operators are defined:
The complete list of DStream operations is available in the API documentation. For the Scala API,
-see [DStream](api/streaming/index.html#org.apache.spark.streaming.dstream.DStream)
-and [PairDStreamFunctions](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
-For the Java API, see [JavaDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.dstream.DStream)
-and [JavaPairDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaPairDStream).
+see [DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream)
+and [PairDStreamFunctions](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
+For the Java API, see [JavaDStream](api/scala/index.html#org.apache.spark.streaming.api.java.dstream.DStream)
+and [JavaPairDStream](api/scala/index.html#org.apache.spark.streaming.api.java.JavaPairDStream).
Specifically for the Java API, see [Spark's Java programming guide](java-programming-guide.html)
for more information.
@@ -881,7 +881,7 @@ Cluster resources maybe under-utilized if the number of parallel tasks used in a
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
parallelism as an argument (see the
-[`PairDStreamFunctions`](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
+[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
documentation), or set the [config property](configuration.html#spark-properties)
`spark.default.parallelism` to change the default.
@@ -925,7 +925,7 @@ A good approach to figure out the right batch size for your application is to te
conservative batch size (say, 5-10 seconds) and a low data rate. To verify whether the system
is able to keep up with data rate, you can check the value of the end-to-end delay experienced
by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the
-[StreamingListener](api/streaming/index.html#org.apache.spark.streaming.scheduler.StreamingListener)
+[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener)
interface).
If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise,
if the delay is continuously increasing, it means that the system is unable to keep up and it
@@ -952,7 +952,7 @@ exception saying so.
## Monitoring
Besides Spark's in-built [monitoring capabilities](monitoring.html),
the progress of a Spark Streaming program can also be monitored using the [StreamingListener]
-(api/streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
+(api/scala/index.html#org.apache.spark.scheduler.StreamingListener) interface,
which allows you to get statistics of batch processing times, queueing delays,
and total end-to-end delays. Note that this is still an experimental API and it is likely to be
improved upon (i.e., more information reported) in the future.
@@ -965,9 +965,9 @@ in Spark Streaming applications and achieving more consistent batch processing t
* **Default persistence level of DStreams**: Unlike RDDs, the default persistence level of DStreams
serializes the data in memory (that is,
-[StorageLevel.MEMORY_ONLY_SER](api/core/index.html#org.apache.spark.storage.StorageLevel$) for
+[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for
DStream compared to
-[StorageLevel.MEMORY_ONLY](api/core/index.html#org.apache.spark.storage.StorageLevel$) for RDDs).
+[StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for RDDs).
Even though keeping the data serialized incurs higher serialization/deserialization overheads,
it significantly reduces GC pauses.
@@ -1244,15 +1244,15 @@ and output 30 after recovery.
# Where to Go from Here
* API documentation
- - Main docs of StreamingContext and DStreams in [Scala](api/streaming/index.html#org.apache.spark.streaming.package)
- and [Java](api/streaming/index.html#org.apache.spark.streaming.api.java.package)
+ - Main docs of StreamingContext and DStreams in [Scala](api/scala/index.html#org.apache.spark.streaming.package)
+ and [Java](api/scala/index.html#org.apache.spark.streaming.api.java.package)
- Additional docs for
- [Kafka](api/external/kafka/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
- [Flume](api/external/flume/index.html#org.apache.spark.streaming.flume.FlumeUtils$),
- [Twitter](api/external/twitter/index.html#org.apache.spark.streaming.twitter.TwitterUtils$),
- [ZeroMQ](api/external/zeromq/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and
- [MQTT](api/external/mqtt/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$)
+ [Kafka](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
+ [Flume](api/scala/index.html#org.apache.spark.streaming.flume.FlumeUtils$),
+ [Twitter](api/scala/index.html#org.apache.spark.streaming.twitter.TwitterUtils$),
+ [ZeroMQ](api/scala/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and
+ [MQTT](api/scala/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$)
* More examples in [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples)
and [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/streaming/examples)
-* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming
+* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming.