diff options
author | petermaxlee <petermaxlee@gmail.com> | 2016-08-26 11:30:23 -0700 |
---|---|---|
committer | Shixiong Zhu <shixiong@databricks.com> | 2016-08-26 11:30:23 -0700 |
commit | 9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88 (patch) | |
tree | a4c5a9776e39309391d1a01b264fa718a7d3926e /mllib/src/test | |
parent | 261c55dd8808502fb7f3384eb537d26a4a8123d7 (diff) | |
download | spark-9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88.tar.gz spark-9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88.tar.bz2 spark-9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88.zip |
[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely
## What changes were proposed in this pull request?
Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.
This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.
## How was this patch tested?
Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.
Author: petermaxlee <petermaxlee@gmail.com>
Closes #14728 from petermaxlee/SPARK-17165.
Diffstat (limited to 'mllib/src/test')
0 files changed, 0 insertions, 0 deletions