aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorpetermaxlee <petermaxlee@gmail.com>2016-08-26 11:30:23 -0700
committerShixiong Zhu <shixiong@databricks.com>2016-08-26 11:30:23 -0700
commit9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88 (patch)
treea4c5a9776e39309391d1a01b264fa718a7d3926e /mllib
parent261c55dd8808502fb7f3384eb537d26a4a8123d7 (diff)
downloadspark-9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88.tar.gz
spark-9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88.tar.bz2
spark-9812f7d5381f7cd8112fd30c7e45ae4f0eab6e88.zip
[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely
## What changes were proposed in this pull request? Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set. This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed. ## How was this patch tested? Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number. Author: petermaxlee <petermaxlee@gmail.com> Closes #14728 from petermaxlee/SPARK-17165.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions