Fixed class paths and dependencies based on Matei's comments.

author: Tathagata Das <tathagata.das1565@gmail.com> 2013-02-24 16:24:52 -0800
committer: Tathagata Das <tathagata.das1565@gmail.com> 2013-02-24 16:24:52 -0800
commit: 5ab37be9831e8a70b2502b14aed1c87cb002a189 (patch)
tree: 52f37dddce0179a41a7855248d970e7fe6513719 /docs
parent: 28f8b721f65fc8e699f208c5dc64d90822a85d91 (diff)
download: spark-5ab37be9831e8a70b2502b14aed1c87cb002a189.tar.gz
spark-5ab37be9831e8a70b2502b14aed1c87cb002a189.tar.bz2
spark-5ab37be9831e8a70b2502b14aed1c87cb002a189.zip
3 files changed, 104 insertions, 3 deletions
diff --git a/docs/plugin-custom-receiver.md b/docs/custom-streaming-receiver.md
index 0eb4246158..0eb4246158 100644
--- a/docs/plugin-custom-receiver.md
+++ b/docs/custom-streaming-receiver.md
diff --git a/docs/streaming-custom-receivers.md b/docs/streaming-custom-receivers.md
new file mode 100644
index 0000000000..0eb4246158
--- /dev/null
+++ b/docs/streaming-custom-receivers.md
@@ -0,0 +1,101 @@
+---
+layout: global
+title: Tutorial - Spark streaming, Plugging in a custom receiver.
+---
+
+A "Spark streaming" receiver can be a simple network stream, streams of messages from a message queue, files etc. A receiver can also assume roles more than just receiving data like filtering, preprocessing, to name a few of the possibilities. The api to plug-in any user defined custom receiver is thus provided to encourage development of receivers which may be well suited to ones specific need.
+
+This guide shows the programming model and features by walking through a simple sample receiver and corresponding Spark Streaming application.
+
+
+## A quick and naive walk-through
+
+### Write a simple receiver
+
+This starts with implementing [Actor](#References)
+
+Following is a simple socket text-stream receiver, which is appearently overly simplified using Akka's socket.io api.
+
+{% highlight scala %}
+
+       class SocketTextStreamReceiver (host:String,
+         port:Int,
+         bytesToString: ByteString => String) extends Actor with Receiver {
+
+          override def preStart = IOManager(context.system).connect(host, port)
+
+          def receive = {
+           case IO.Read(socket, bytes) => pushBlock(bytesToString(bytes))
+         }
+
+       }
+
+
+{% endhighlight %}
+
+All we did here is mixed in trait Receiver and called pushBlock api method to push our blocks of data. Please refer to scala-docs of Receiver for more details.
+
+### A sample spark application
+
+* First create a Spark streaming context with master url and batchduration.
+
+{% highlight scala %}
+
+    val ssc = new StreamingContext(master, "WordCountCustomStreamSource",
+      Seconds(batchDuration))
+
+{% endhighlight %}
+
+* Plug-in the actor configuration into the spark streaming context and create a DStream.
+
+{% highlight scala %}
+
+    val lines = ssc.actorStream[String](Props(new SocketTextStreamReceiver(
+      "localhost",8445, z => z.utf8String)),"SocketReceiver")
+
+{% endhighlight %}
+
+* Process it.
+
+{% highlight scala %}
+
+    val words = lines.flatMap(_.split(" "))
+    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
+
+    wordCounts.print()
+    ssc.start()
+
+
+{% endhighlight %}
+
+* After processing it, stream can be tested using the netcat utility.
+
+     $ nc -l localhost 8445
+     hello world
+     hello hello
+
+
+## Multiple homogeneous/heterogeneous receivers.
+
+A DStream union operation is provided for taking union on multiple input streams.
+
+{% highlight scala %}
+
+    val lines = ssc.actorStream[String](Props(new SocketTextStreamReceiver(
+      "localhost",8445, z => z.utf8String)),"SocketReceiver")
+
+    // Another socket stream receiver
+    val lines2 = ssc.actorStream[String](Props(new SocketTextStreamReceiver(
+      "localhost",8446, z => z.utf8String)),"SocketReceiver")
+
+    val union = lines.union(lines2)
+
+{% endhighlight %}
+
+Above stream can be easily process as described earlier.
+
+_A more comprehensive example is provided in the spark streaming examples_
+
+## References
+
+1.[Akka Actor documentation](http://doc.akka.io/docs/akka/2.0.5/scala/actors.html)
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index ded43e67cd..0e618a06c7 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -365,14 +365,14 @@ There are two failure behaviors based on which input sources are used.
 
 Since all data is modeled as RDDs with their lineage of deterministic operations, any recomputation always leads to the same result. As a result, all DStream transformations are guaranteed to have _exactly-once_ semantics. That is, the final transformed result will be same even if there were was a worker node failure. However, output operations (like `foreach`) have _at-least once_ semantics, that is, the transformed data may get written to an external entity more than once in the event of a worker failure. While this is acceptable for saving to HDFS using the `saveAs*Files` operations (as the file will simply get over-written by the same data), additional transactions-like mechanisms may be necessary to achieve exactly-once semantics for output operations.
 
-## Failure of a Driver Node
-A system that is required to operate 24/7 needs to be able tolerate the failure of the drive node as well. Spark Streaming does this by saving the state of the DStream computation periodically to a HDFS file, that can be used to restart the streaming computation in the event of a failure of the driver node. To elaborate, the following state is periodically saved to a file.
+## Failure of the Driver Node
+A system that is required to operate 24/7 needs to be able tolerate the failure of the driver node as well. Spark Streaming does this by saving the state of the DStream computation periodically to a HDFS file, that can be used to restart the streaming computation in the event of a failure of the driver node. This checkpointing is enabled by setting a HDFS directory for checkpointing using `ssc.checkpoint(<checkpoint directory>)` as described [earlier](#rdd-checkpointing-within-dstreams). To elaborate, the following state is periodically saved to a file.
 
 1. The DStream operator graph (input streams, output streams, etc.)
 1. The configuration of each DStream (checkpoint interval, etc.)
 1. The RDD checkpoint files of each DStream
 
-All this is periodically saved in the file `<checkpoint directory>/graph` where `<checkpoint directory>` is the HDFS path set using `ssc.checkpoint(...)` as described earlier. To recover, a new Streaming Context can be created with this directory by using
+All this is periodically saved in the file `<checkpoint directory>/graph`. To recover, a new Streaming Context can be created with this directory by using
 
 {% highlight scala %}
 val ssc = new StreamingContext(checkpointDirectory)
author	Tathagata Das <tathagata.das1565@gmail.com>	2013-02-24 16:24:52 -0800
committer	Tathagata Das <tathagata.das1565@gmail.com>	2013-02-24 16:24:52 -0800
commit	5ab37be9831e8a70b2502b14aed1c87cb002a189 (patch)
tree	52f37dddce0179a41a7855248d970e7fe6513719 /docs
parent	28f8b721f65fc8e699f208c5dc64d90822a85d91 (diff)
download	spark-5ab37be9831e8a70b2502b14aed1c87cb002a189.tar.gz spark-5ab37be9831e8a70b2502b14aed1c87cb002a189.tar.bz2 spark-5ab37be9831e8a70b2502b14aed1c87cb002a189.zip