aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorcody koeninger <cody@koeninger.org>2016-10-21 15:55:04 -0700
committerShixiong Zhu <shixiong@databricks.com>2016-10-21 15:55:04 -0700
commit268ccb9a48dfefc4d7bc85155e7e20a2dfe89307 (patch)
tree799a06a1de6b8767824654ecb2309da8e4f2543a /docs
parent140570252fd3739d6bdcadd6d4d5a180e480d3e0 (diff)
downloadspark-268ccb9a48dfefc4d7bc85155e7e20a2dfe89307.tar.gz
spark-268ccb9a48dfefc4d7bc85155e7e20a2dfe89307.tar.bz2
spark-268ccb9a48dfefc4d7bc85155e7e20a2dfe89307.zip
[SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream
## What changes were proposed in this pull request? startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy assign with specific topicpartitions as a consumer strategy ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #15504 from koeninger/SPARK-17812.
Diffstat (limited to 'docs')
-rw-r--r--docs/structured-streaming-kafka-integration.md38
1 files changed, 26 insertions, 12 deletions
diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md
index 668489addf..e851f210c9 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -151,15 +151,24 @@ The following options must be set for the Kafka source.
<table class="table">
<tr><th>Option</th><th>value</th><th>meaning</th></tr>
<tr>
+ <td>assign</td>
+ <td>json string {"topicA":[0,1],"topicB":[2,4]}</td>
+ <td>Specific TopicPartitions to consume.
+ Only one of "assign", "subscribe" or "subscribePattern"
+ options can be specified for Kafka source.</td>
+</tr>
+<tr>
<td>subscribe</td>
<td>A comma-separated list of topics</td>
- <td>The topic list to subscribe. Only one of "subscribe" and "subscribePattern" options can be
- specified for Kafka source.</td>
+ <td>The topic list to subscribe.
+ Only one of "assign", "subscribe" or "subscribePattern"
+ options can be specified for Kafka source.</td>
</tr>
<tr>
<td>subscribePattern</td>
<td>Java regex string</td>
- <td>The pattern used to subscribe the topic. Only one of "subscribe" and "subscribePattern"
+ <td>The pattern used to subscribe to topic(s).
+ Only one of "assign, "subscribe" or "subscribePattern"
options can be specified for Kafka source.</td>
</tr>
<tr>
@@ -174,16 +183,21 @@ The following configurations are optional:
<table class="table">
<tr><th>Option</th><th>value</th><th>default</th><th>meaning</th></tr>
<tr>
- <td>startingOffset</td>
- <td>["earliest", "latest"]</td>
- <td>"latest"</td>
- <td>The start point when a query is started, either "earliest" which is from the earliest offset,
- or "latest" which is just from the latest offset. Note: This only applies when a new Streaming q
- uery is started, and that resuming will always pick up from where the query left off.</td>
+ <td>startingOffsets</td>
+ <td>earliest, latest, or json string
+ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}
+ </td>
+ <td>latest</td>
+ <td>The start point when a query is started, either "earliest" which is from the earliest offsets,
+ "latest" which is just from the latest offsets, or a json string specifying a starting offset for
+ each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest.
+ Note: This only applies when a new Streaming query is started, and that resuming will always pick
+ up from where the query left off. Newly discovered partitions during a query will start at
+ earliest.</td>
</tr>
<tr>
<td>failOnDataLoss</td>
- <td>[true, false]</td>
+ <td>true or false</td>
<td>true</td>
<td>Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or
offsets are out of range). This may be a false alarm. You can disable it when it doesn't work
@@ -215,10 +229,10 @@ Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.
Note that the following Kafka params cannot be set and the Kafka source will throw an exception:
- **group.id**: Kafka source will create a unique group id for each query automatically.
-- **auto.offset.reset**: Set the source option `startingOffset` to `earliest` or `latest` to specify
+- **auto.offset.reset**: Set the source option `startingOffsets` to specify
where to start instead. Structured Streaming manages which offsets are consumed internally, rather
than rely on the kafka Consumer to do it. This will ensure that no data is missed when when new
- topics/partitions are dynamically subscribed. Note that `startingOffset` only applies when a new
+ topics/partitions are dynamically subscribed. Note that `startingOffsets` only applies when a new
Streaming query is started, and that resuming will always pick up from where the query left off.
- **key.deserializer**: Keys are always deserialized as byte arrays with ByteArrayDeserializer. Use
DataFrame operations to explicitly deserialize the keys.