aboutsummaryrefslogtreecommitdiff
path: root/docs/streaming-kinesis-integration.md
diff options
context:
space:
mode:
authorXin Ren <iamshrek@126.com>2016-07-11 18:09:14 -0700
committerTathagata Das <tathagata.das1565@gmail.com>2016-07-11 18:09:14 -0700
commit05d7151ccbccdd977ec2f2301d5b12566018c988 (patch)
tree1e0a4c46e64ebcbb0e32794dda9a487383c4b858 /docs/streaming-kinesis-integration.md
parent9e2c763dbb5ac6fc5d2eb0759402504d4b9073a4 (diff)
downloadspark-05d7151ccbccdd977ec2f2301d5b12566018c988.tar.gz
spark-05d7151ccbccdd977ec2f2301d5b12566018c988.tar.bz2
spark-05d7151ccbccdd977ec2f2301d5b12566018c988.zip
[MINOR][STREAMING][DOCS] Minor changes on kinesis integration
## What changes were proposed in this pull request? Some minor changes for documentation page "Spark Streaming + Kinesis Integration". Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets. ## How was this patch tested? Tested manually, on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14097 from keypointt/kinesisDoc.
Diffstat (limited to 'docs/streaming-kinesis-integration.md')
-rw-r--r--docs/streaming-kinesis-integration.md26
1 files changed, 13 insertions, 13 deletions
diff --git a/docs/streaming-kinesis-integration.md b/docs/streaming-kinesis-integration.md
index 5b9a7554d2..96198ddf53 100644
--- a/docs/streaming-kinesis-integration.md
+++ b/docs/streaming-kinesis-integration.md
@@ -111,7 +111,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
- `[checkpoint interval]`: The interval (e.g., Duration(2000) = 2 seconds) at which the Kinesis Client Library saves its position in the stream. For starters, set it to the same as the batch interval of the streaming application.
- - `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see Kinesis Checkpointing section and Amazon Kinesis API documentation for more details).
+ - `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see [`Kinesis Checkpointing`](#kinesis-checkpointing) section and [`Amazon Kinesis API documentation`](http://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html) for more details).
- `[message handler]`: A function that takes a Kinesis `Record` and outputs generic `T`.
@@ -128,14 +128,6 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
Alternatively, you can also download the JAR of the Maven artifact `spark-streaming-kinesis-asl-assembly` from the
[Maven repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kinesis-asl-assembly_{{site.SCALA_BINARY_VERSION}}%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22) and add it to `spark-submit` with `--jars`.
- *Points to remember at runtime:*
-
- - Kinesis data processing is ordered per partition and occurs at-least once per message.
-
- - Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.
-
- - A single Kinesis stream shard is processed by one input DStream at a time.
-
<p style="text-align: center;">
<img src="img/streaming-kinesis-arch.png"
title="Spark Streaming Kinesis Architecture"
@@ -145,6 +137,14 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
+ *Points to remember at runtime:*
+
+ - Kinesis data processing is ordered per partition and occurs at-least once per message.
+
+ - Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.
+
+ - A single Kinesis stream shard is processed by one input DStream at a time.
+
- A single Kinesis input DStream can read from multiple shards of a Kinesis stream by creating multiple KinesisRecordProcessor threads.
- Multiple input DStreams running in separate processes/instances can read from a Kinesis stream.
@@ -173,7 +173,7 @@ To run the example,
- Set up Kinesis stream (see earlier section) within AWS. Note the name of the Kinesis stream and the endpoint URL corresponding to the region where the stream was created.
-- Set up the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_KEY with your AWS credentials.
+- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials.
- In the Spark root directory, run the example as
@@ -216,6 +216,6 @@ de-aggregate records during consumption.
- Checkpointing too frequently will cause excess load on the AWS checkpoint storage layer and may lead to AWS throttling. The provided example handles this throttling with a random-backoff-retry strategy.
-- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from the latest tip (InitialPositionInStream.LATEST). This is configurable.
-- InitialPositionInStream.LATEST could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
-- InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.
+- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (`InitialPositionInStream.TRIM_HORIZON`) or from the latest tip (`InitialPositionInStream.LATEST`). This is configurable.
+ - `InitialPositionInStream.LATEST` could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
+ - `InitialPositionInStream.TRIM_HORIZON` may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.