[SPARK-2419][Streaming][Docs] Updates to the streaming programming guide

Updated the main streaming programming guide, and also added source-specific guides for Kafka, Flume, Kinesis. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Jacek Laskowski <jacek@japila.pl> Closes #2254 from tdas/streaming-doc-fix and squashes the following commits: e45c6d7 [Jacek Laskowski] More fixes from an old PR 5125316 [Tathagata Das] Fixed links dc02f26 [Tathagata Das] Refactored streaming kinesis guide and made many other changes. acbc3e3 [Tathagata Das] Fixed links between streaming guides. cb7007f [Tathagata Das] Added Streaming + Flume integration guide. 9bd9407 [Tathagata Das] Updated streaming programming guide with additional information from SPARK-2419.
author: Tathagata Das <tathagata.das1565@gmail.com> 2014-09-03 17:38:01 -0700
committer: Tathagata Das <tathagata.das1565@gmail.com> 2014-09-03 17:38:01 -0700
commit: a5224079286d1777864cf9fa77330aadae10cd7b (patch)
tree: b44c8672b86a6b38769b62484772c6f237c39480 /docs/streaming-kafka-integration.md
parent: 996b7434ee0d0c7c26987eb9cf050c139fdd2db2 (diff)
download: spark-a5224079286d1777864cf9fa77330aadae10cd7b.tar.gz
spark-a5224079286d1777864cf9fa77330aadae10cd7b.tar.bz2
spark-a5224079286d1777864cf9fa77330aadae10cd7b.zip
1 files changed, 42 insertions, 0 deletions
diff --git a/docs/streaming-kafka-integration.md b/docs/streaming-kafka-integration.md
new file mode 100644
index 0000000000..a3b705d4c3
--- /dev/null
+++ b/docs/streaming-kafka-integration.md
@@ -0,0 +1,42 @@
+---
+layout: global
+title: Spark Streaming + Kafka Integration Guide
+---
+[Apache Kafka](http://kafka.apache.org/) is publish-subscribe messaging rethought as a distributed, partitioned, replicated commit log service.  Here we explain how to configure Spark Streaming to receive data from Kafka.
+
+1. **Linking:** In your SBT/Maven projrect definition, link your streaming application against the following artifact (see [Linking section](streaming-programming-guide.html#linking) in the main programming guide for further information).
+
+		groupId = org.apache.spark
+		artifactId = spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}}
+		version = {{site.SPARK_VERSION_SHORT}}
+
+2. **Programming:** In the streaming application code, import `KafkaUtils` and create input DStream as follows.
+
+	<div class="codetabs">
+	<div data-lang="scala" markdown="1">
+		import org.apache.spark.streaming.kafka._
+
+		val kafkaStream = KafkaUtils.createStream(
+        	streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume])
+
+	See the [API docs](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$)
+	and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala).
+	</div>
+	<div data-lang="java" markdown="1">
+		import org.apache.spark.streaming.kafka.*;
+
+		JavaPairReceiverInputDStream<String, String> kafkaStream = KafkaUtils.createStream(
+        	streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume]);
+
+	See the [API docs](api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html)
+	and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java).
+	</div>
+	</div>
+
+	*Points to remember:*
+
+	- Topic partitions in Kafka does not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the `KafkaUtils.createStream()` only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that.
+
+	- Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.
+
+3. **Deploying:** Package `spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}}` and its dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by `spark-submit`) into the application JAR. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).
author	Tathagata Das <tathagata.das1565@gmail.com>	2014-09-03 17:38:01 -0700
committer	Tathagata Das <tathagata.das1565@gmail.com>	2014-09-03 17:38:01 -0700
commit	a5224079286d1777864cf9fa77330aadae10cd7b (patch)
tree	b44c8672b86a6b38769b62484772c6f237c39480 /docs/streaming-kafka-integration.md
parent	996b7434ee0d0c7c26987eb9cf050c139fdd2db2 (diff)
download	spark-a5224079286d1777864cf9fa77330aadae10cd7b.tar.gz spark-a5224079286d1777864cf9fa77330aadae10cd7b.tar.bz2 spark-a5224079286d1777864cf9fa77330aadae10cd7b.zip