diff options
author | Burak Yavuz <brkyvz@gmail.com> | 2016-09-21 17:12:52 -0700 |
---|---|---|
committer | Shixiong Zhu <shixiong@databricks.com> | 2016-09-21 17:12:52 -0700 |
commit | 7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0 (patch) | |
tree | 91620fefef7ee75bbc6956edb2d49f7dcd38d0ac /core/src/main/scala/org/apache | |
parent | 8c3ee2bc42e6320b9341cebdba51a00162c897ea (diff) | |
download | spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.tar.gz spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.tar.bz2 spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.zip |
[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster
## What changes were proposed in this pull request?
While getting the batch for a `FileStreamSource` in StructuredStreaming, we know which files we must take specifically. We already have verified that they exist, and have committed them to a metadata log. When creating the FileSourceRelation however for an incremental execution, the code checks the existence of every single file once again!
When you have 100,000s of files in a folder, creating the first batch takes 2 hours+ when working with S3! This PR disables that check
## How was this patch tested?
Added a unit test to `FileStreamSource`.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #15122 from brkyvz/SPARK-17569.
Diffstat (limited to 'core/src/main/scala/org/apache')
0 files changed, 0 insertions, 0 deletions