[SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource - spark

diff options

author	Shixiong Zhu <shixiong@databricks.com>	2016-12-15 13:17:51 -0800
committer	Tathagata Das <tathagata.das1565@gmail.com>	2016-12-15 13:17:51 -0800
commit	68a6dc974b25e6eddef109f6fd23ae4e9775ceca (patch)
tree	fbe30c950ce49b783c38998fed75cc0240379c50 /python
parent	4f7292c87512a7da3542998d0e5aa21c27a511e9 (diff)
download	spark-68a6dc974b25e6eddef109f6fd23ae4e9775ceca.tar.gz spark-68a6dc974b25e6eddef109f6fd23ae4e9775ceca.tar.bz2 spark-68a6dc974b25e6eddef109f6fd23ae4e9775ceca.zip

[SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource

## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16251 from zsxwing/newest-first.

Diffstat (limited to 'python')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: