aboutsummaryrefslogtreecommitdiff
path: root/R
diff options
context:
space:
mode:
authorBurak Yavuz <brkyvz@gmail.com>2016-09-21 17:12:52 -0700
committerShixiong Zhu <shixiong@databricks.com>2016-09-21 17:12:52 -0700
commit7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0 (patch)
tree91620fefef7ee75bbc6956edb2d49f7dcd38d0ac /R
parent8c3ee2bc42e6320b9341cebdba51a00162c897ea (diff)
downloadspark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.tar.gz
spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.tar.bz2
spark-7cbe2164499e83b6c009fdbab0fbfffe89a2ecc0.zip
[SPARK-17569] Make StructuredStreaming FileStreamSource batch generation faster
## What changes were proposed in this pull request? While getting the batch for a `FileStreamSource` in StructuredStreaming, we know which files we must take specifically. We already have verified that they exist, and have committed them to a metadata log. When creating the FileSourceRelation however for an incremental execution, the code checks the existence of every single file once again! When you have 100,000s of files in a folder, creating the first batch takes 2 hours+ when working with S3! This PR disables that check ## How was this patch tested? Added a unit test to `FileStreamSource`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #15122 from brkyvz/SPARK-17569.
Diffstat (limited to 'R')
0 files changed, 0 insertions, 0 deletions