[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster - spark

diff options

author	Burak Yavuz <brkyvz@gmail.com>	2016-09-21 17:12:52 -0700
committer	Shixiong Zhu <shixiong@databricks.com>	2016-09-21 17:12:52 -0700
commit	7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0 (patch)
tree	91620fefef7ee75bbc6956edb2d49f7dcd38d0ac /R
parent	8c3ee2bc42e6320b9341cebdba51a00162c897ea (diff)
download	spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.tar.gz spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.tar.bz2 spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.zip

[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster

## What changes were proposed in this pull request? While getting the batch for a `FileStreamSource` in StructuredStreaming, we know which files we must take specifically. We already have verified that they exist, and have committed them to a metadata log. When creating the FileSourceRelation however for an incremental execution, the code checks the existence of every single file once again! When you have 100,000s of files in a folder, creating the first batch takes 2 hours+ when working with S3! This PR disables that check ## How was this patch tested? Added a unit test to `FileStreamSource`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #15122 from brkyvz/SPARK-17569.

Diffstat (limited to 'R')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: