aboutsummaryrefslogtreecommitdiff
path: root/docs/streaming-kafka-0-10-integration.md
diff options
context:
space:
mode:
authoruncleGen <hustyugm@gmail.com>2017-01-24 11:32:11 +0000
committerSean Owen <sowen@cloudera.com>2017-01-24 11:32:11 +0000
commit7c61c2a1c40629311b84dff8d91b257efb345d07 (patch)
tree01c01629df495d870228e79496b74b55ede520b7 /docs/streaming-kafka-0-10-integration.md
parentf27e024768e328b96704a9ef35b77381da480328 (diff)
downloadspark-7c61c2a1c40629311b84dff8d91b257efb345d07.tar.gz
spark-7c61c2a1c40629311b84dff8d91b257efb345d07.tar.bz2
spark-7c61c2a1c40629311b84dff8d91b257efb345d07.zip
[DOCS] Fix typo in docs
## What changes were proposed in this pull request? Fix typo in docs ## How was this patch tested? Author: uncleGen <hustyugm@gmail.com> Closes #16658 from uncleGen/typo-issue.
Diffstat (limited to 'docs/streaming-kafka-0-10-integration.md')
-rw-r--r--docs/streaming-kafka-0-10-integration.md2
1 files changed, 1 insertions, 1 deletions
diff --git a/docs/streaming-kafka-0-10-integration.md b/docs/streaming-kafka-0-10-integration.md
index b645d3c3a4..6ef54ac210 100644
--- a/docs/streaming-kafka-0-10-integration.md
+++ b/docs/streaming-kafka-0-10-integration.md
@@ -183,7 +183,7 @@ stream.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<String, String>>>() {
Note that the typecast to `HasOffsetRanges` will only succeed if it is done in the first method called on the result of `createDirectStream`, not later down a chain of methods. Be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window().
### Storing Offsets
-Kafka delivery semantics in the case of failure depend on how and when offsets are stored. Spark output operations are [at-least-once](streaming-programming-guide.html#semantics-of-output-operations). So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. With this integration, you have 3 options, in order of increasing reliablity (and code complexity), for how to store offsets.
+Kafka delivery semantics in the case of failure depend on how and when offsets are stored. Spark output operations are [at-least-once](streaming-programming-guide.html#semantics-of-output-operations). So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. With this integration, you have 3 options, in order of increasing reliability (and code complexity), for how to store offsets.
#### Checkpoints
If you enable Spark [checkpointing](streaming-programming-guide.html#checkpointing), offsets will be stored in the checkpoint. This is easy to enable, but there are drawbacks. Your output operation must be idempotent, since you will get repeated outputs; transactions are not an option. Furthermore, you cannot recover from a checkpoint if your application code has changed. For planned upgrades, you can mitigate this by running the new code at the same time as the old code (since outputs need to be idempotent anyway, they should not clash). But for unplanned failures that require code changes, you will lose data unless you have another way to identify known good starting offsets.