diff options
author | Tathagata Das <tathagata.das1565@gmail.com> | 2014-11-24 13:50:20 -0800 |
---|---|---|
committer | Tathagata Das <tathagata.das1565@gmail.com> | 2014-11-24 13:50:47 -0800 |
commit | 6fa3e415d419ee9b2f3d14106a714b627e251e7d (patch) | |
tree | 4b10c243c818cae04570152f0014474fe3ab2067 /python | |
parent | 2d35cc0852e5ce426b143b51d03a71f16ad06c11 (diff) | |
download | spark-6fa3e415d419ee9b2f3d14106a714b627e251e7d.tar.gz spark-6fa3e415d419ee9b2f3d14106a714b627e251e7d.tar.bz2 spark-6fa3e415d419ee9b2f3d14106a714b627e251e7d.zip |
[SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times
Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories.
pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #3419 from tdas/filestream-fix2 and squashes the following commits:
c19dd8a [Tathagata Das] Addressed PR comments.
513b608 [Tathagata Das] Updated docs.
d364faf [Tathagata Das] Added the current time condition back
5526222 [Tathagata Das] Removed unnecessary imports.
38bb736 [Tathagata Das] Fix long line.
203bbc7 [Tathagata Das] Un-ignore tests.
eaef4e1 [Tathagata Das] Fixed SPARK-4519
9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.
(cherry picked from commit cb0e9b0980f38befe88bf52aa037fe33262730f7)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Diffstat (limited to 'python')
0 files changed, 0 insertions, 0 deletions