// Subscribe to 1 topic val ds1 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] // Subscribe to multiple topics val ds2 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .load() ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] // Subscribe to a pattern val ds3 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .load() ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)]

// Subscribe to 1 topic Dataset ds1 = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") // Subscribe to multiple topics Dataset ds2 = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .load() ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") // Subscribe to a pattern Dataset ds3 = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .load() ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

# Subscribe to 1 topic ds1 = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") # Subscribe to multiple topics ds2 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2") .load() ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") # Subscribe to a pattern ds3 = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribePattern", "topic.*") .load() ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Column	Type
key	binary
value	binary
topic	string
partition	int
offset	long
timestamp	long
timestampType	int

Column

Type

key

binary

value

binary

topic

string

partition

int

offset

long

timestamp

long

timestampType

int

Option	value	meaning
subscribe	A comma-separated list of topics	The topic list to subscribe. Only one of "subscribe" and "subscribePattern" options can be specified for Kafka source.
subscribePattern	Java regex string	The pattern used to subscribe the topic. Only one of "subscribe" and "subscribePattern" options can be specified for Kafka source.
kafka.bootstrap.servers	A comma-separated list of host:port	The Kafka "bootstrap.servers" configuration.

Option

value

meaning

A comma-separated list of topics

The topic list to subscribe. Only one of "subscribe" and "subscribePattern" options can be specified for Kafka source.

subscribePattern

Java regex string

The pattern used to subscribe the topic. Only one of "subscribe" and "subscribePattern" options can be specified for Kafka source.

kafka.bootstrap.servers

A comma-separated list of host:port

The Kafka "bootstrap.servers" configuration.

Option	value	default	meaning
startingOffset	["earliest", "latest"]	"latest"	The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming q uery is started, and that resuming will always pick up from where the query left off.
failOnDataLoss	[true, false]	true	Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
kafkaConsumer.pollTimeoutMs	long	512	The timeout in milliseconds to poll data from Kafka in executors.
fetchOffset.numRetries	int	3	Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs	long	10	milliseconds to wait before retrying to fetch Kafka offsets

Option

value

default

meaning

startingOffset

["earliest", "latest"]

"latest"

The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming q uery is started, and that resuming will always pick up from where the query left off.

failOnDataLoss

[true, false]

true

Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.

kafkaConsumer.pollTimeoutMs

long

512

The timeout in milliseconds to poll data from Kafka in executors.

fetchOffset.numRetries

int

Number of times to retry before giving up fatch Kafka latest offsets.

fetchOffset.retryIntervalMs

long

milliseconds to wait before retrying to fetch Kafka offsets