aboutsummaryrefslogtreecommitdiff
path: root/docs/streaming-programming-guide.md
diff options
context:
space:
mode:
authorMatei Zaharia <matei@databricks.com>2014-04-21 21:57:40 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-04-21 21:57:40 -0700
commitfc7838470465474f777bd17791c1bb5f9c348521 (patch)
tree6809dfd66ebafa6dced2018585a3a1f9ba270d53 /docs/streaming-programming-guide.md
parent04c37b6f749dc2418cc28c89964cdc687dfcbd51 (diff)
downloadspark-fc7838470465474f777bd17791c1bb5f9c348521.tar.gz
spark-fc7838470465474f777bd17791c1bb5f9c348521.tar.bz2
spark-fc7838470465474f777bd17791c1bb5f9c348521.zip
[SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs
I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one. Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/ Author: Matei Zaharia <matei@databricks.com> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <pwendell@gmail.com> Closes #457 from mateiz/better-docs and squashes the following commits: a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package 5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load f05abc0 [Matei Zaharia] Don't include java.lang package names 995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc a14a93c [Matei Zaharia] typo 76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
Diffstat (limited to 'docs/streaming-programming-guide.md')
-rw-r--r--docs/streaming-programming-guide.md56
1 files changed, 28 insertions, 28 deletions
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index f9904d4501..946d6c4879 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -40,7 +40,7 @@ Spark Streaming provides a high-level abstraction called *discretized stream* or
which represents a continuous stream of data. DStreams can be created either from input data
stream from sources such as Kafka and Flume, or by applying high-level
operations on other DStreams. Internally, a DStream is represented as a sequence of
-[RDDs](api/core/index.html#org.apache.spark.rdd.RDD).
+[RDDs](api/scala/index.html#org.apache.spark.rdd.RDD).
This guide shows you how to start writing Spark Streaming programs with DStreams. You can
write Spark Streaming programs in Scala or Java, both of which are presented in this guide. You
@@ -62,7 +62,7 @@ First, we import the names of the Spark Streaming classes, and some implicit
conversions from StreamingContext into our environment, to add useful methods to
other classes we need (like DStream).
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) is the
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) is the
main entry point for all streaming functionality.
{% highlight scala %}
@@ -71,7 +71,7 @@ import org.apache.spark.streaming.StreamingContext._
{% endhighlight %}
Then we create a
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) object.
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) object.
Besides Spark's configuration, we specify that any DStream will be processed
in 1 second batches.
@@ -132,7 +132,7 @@ The complete code can be found in the Spark Streaming example
<div data-lang="java" markdown="1">
First, we create a
-[JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object,
+[JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object,
which is the main entry point for all streaming
functionality. Besides Spark's configuration, we specify that any DStream would be processed
in 1 second batches.
@@ -168,7 +168,7 @@ JavaDStream<String> words = lines.flatMap(
generating multiple new records from each record in the source DStream. In this case,
each line will be split into multiple words and the stream of words is represented as the
`words` DStream. Note that we defined the transformation using a
-[FlatMapFunction](api/core/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
+[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
As we will discover along the way, there are a number of such convenience classes in the Java API
that help define DStream transformations.
@@ -192,9 +192,9 @@ wordCounts.print(); // Print a few of the counts to the console
{% endhighlight %}
The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word,
-1)` pairs, using a [PairFunction](api/core/index.html#org.apache.spark.api.java.function.PairFunction)
+1)` pairs, using a [PairFunction](api/scala/index.html#org.apache.spark.api.java.function.PairFunction)
object. Then, it is reduced to get the frequency of words in each batch of data,
-using a [Function2](api/core/index.html#org.apache.spark.api.java.function.Function2) object.
+using a [Function2](api/scala/index.html#org.apache.spark.api.java.function.Function2) object.
Finally, `wordCounts.print()` will print a few of the counts generated every second.
Note that when these lines are executed, Spark Streaming only sets up the computation it
@@ -333,7 +333,7 @@ for the full list of supported sources and artifacts.
<div data-lang="scala" markdown="1">
To initialize a Spark Streaming program in Scala, a
-[`StreamingContext`](api/streaming/index.html#org.apache.spark.streaming.StreamingContext)
+[`StreamingContext`](api/scala/index.html#org.apache.spark.streaming.StreamingContext)
object has to be created, which is the main entry point of all Spark Streaming functionality.
A `StreamingContext` object can be created by using
@@ -344,7 +344,7 @@ new StreamingContext(master, appName, batchDuration, [sparkHome], [jars])
<div data-lang="java" markdown="1">
To initialize a Spark Streaming program in Java, a
-[`JavaStreamingContext`](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
+[`JavaStreamingContext`](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
object has to be created, which is the main entry point of all Spark Streaming functionality.
A `JavaStreamingContext` object can be created by using
@@ -431,8 +431,8 @@ and process any files created in that directory. Note that
For more details on streams from files, Akka actors and sockets,
see the API documentations of the relevant functions in
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) for
-Scala and [JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for
+Scala and [JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
for Java.
Additional functionality for creating DStreams from sources such as Kafka, Flume, and Twitter
@@ -802,10 +802,10 @@ output operators are defined:
The complete list of DStream operations is available in the API documentation. For the Scala API,
-see [DStream](api/streaming/index.html#org.apache.spark.streaming.dstream.DStream)
-and [PairDStreamFunctions](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
-For the Java API, see [JavaDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.dstream.DStream)
-and [JavaPairDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaPairDStream).
+see [DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream)
+and [PairDStreamFunctions](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
+For the Java API, see [JavaDStream](api/scala/index.html#org.apache.spark.streaming.api.java.dstream.DStream)
+and [JavaPairDStream](api/scala/index.html#org.apache.spark.streaming.api.java.JavaPairDStream).
Specifically for the Java API, see [Spark's Java programming guide](java-programming-guide.html)
for more information.
@@ -881,7 +881,7 @@ Cluster resources maybe under-utilized if the number of parallel tasks used in a
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
parallelism as an argument (see the
-[`PairDStreamFunctions`](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
+[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
documentation), or set the [config property](configuration.html#spark-properties)
`spark.default.parallelism` to change the default.
@@ -925,7 +925,7 @@ A good approach to figure out the right batch size for your application is to te
conservative batch size (say, 5-10 seconds) and a low data rate. To verify whether the system
is able to keep up with data rate, you can check the value of the end-to-end delay experienced
by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the
-[StreamingListener](api/streaming/index.html#org.apache.spark.streaming.scheduler.StreamingListener)
+[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener)
interface).
If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise,
if the delay is continuously increasing, it means that the system is unable to keep up and it
@@ -952,7 +952,7 @@ exception saying so.
## Monitoring
Besides Spark's in-built [monitoring capabilities](monitoring.html),
the progress of a Spark Streaming program can also be monitored using the [StreamingListener]
-(api/streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
+(api/scala/index.html#org.apache.spark.scheduler.StreamingListener) interface,
which allows you to get statistics of batch processing times, queueing delays,
and total end-to-end delays. Note that this is still an experimental API and it is likely to be
improved upon (i.e., more information reported) in the future.
@@ -965,9 +965,9 @@ in Spark Streaming applications and achieving more consistent batch processing t
* **Default persistence level of DStreams**: Unlike RDDs, the default persistence level of DStreams
serializes the data in memory (that is,
-[StorageLevel.MEMORY_ONLY_SER](api/core/index.html#org.apache.spark.storage.StorageLevel$) for
+[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for
DStream compared to
-[StorageLevel.MEMORY_ONLY](api/core/index.html#org.apache.spark.storage.StorageLevel$) for RDDs).
+[StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for RDDs).
Even though keeping the data serialized incurs higher serialization/deserialization overheads,
it significantly reduces GC pauses.
@@ -1244,15 +1244,15 @@ and output 30 after recovery.
# Where to Go from Here
* API documentation
- - Main docs of StreamingContext and DStreams in [Scala](api/streaming/index.html#org.apache.spark.streaming.package)
- and [Java](api/streaming/index.html#org.apache.spark.streaming.api.java.package)
+ - Main docs of StreamingContext and DStreams in [Scala](api/scala/index.html#org.apache.spark.streaming.package)
+ and [Java](api/scala/index.html#org.apache.spark.streaming.api.java.package)
- Additional docs for
- [Kafka](api/external/kafka/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
- [Flume](api/external/flume/index.html#org.apache.spark.streaming.flume.FlumeUtils$),
- [Twitter](api/external/twitter/index.html#org.apache.spark.streaming.twitter.TwitterUtils$),
- [ZeroMQ](api/external/zeromq/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and
- [MQTT](api/external/mqtt/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$)
+ [Kafka](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
+ [Flume](api/scala/index.html#org.apache.spark.streaming.flume.FlumeUtils$),
+ [Twitter](api/scala/index.html#org.apache.spark.streaming.twitter.TwitterUtils$),
+ [ZeroMQ](api/scala/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and
+ [MQTT](api/scala/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$)
* More examples in [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples)
and [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/streaming/examples)
-* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming
+* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming.