diff options
author | Liwei Lin <lwlin7@gmail.com> | 2017-02-28 22:58:51 -0800 |
---|---|---|
committer | Shixiong Zhu <shixiong@databricks.com> | 2017-02-28 22:58:51 -0800 |
commit | 4913c92c2fbfcc22b41afb8ce79687165392d7da (patch) | |
tree | 3879e2eed39d386aaf67383b7f6abdb170e923f0 /mllib | |
parent | 89cd3845b6edb165236a6498dcade033975ee276 (diff) | |
download | spark-4913c92c2fbfcc22b41afb8ce79687165392d7da.tar.gz spark-4913c92c2fbfcc22b41afb8ce79687165392d7da.tar.bz2 spark-4913c92c2fbfcc22b41afb8ce79687165392d7da.zip |
[SPARK-19633][SS] FileSource read from FileSink
## What changes were proposed in this pull request?
Right now file source always uses `InMemoryFileIndex` to scan files from a given path.
But when reading the outputs from another streaming query, the file source should use `MetadataFileIndex` to list files from the sink log. This patch adds this support.
## `MetadataFileIndex` or `InMemoryFileIndex`
```scala
spark
.readStream
.format(...)
.load("/some/path") // for a non-glob path:
// - use `MetadataFileIndex` when `/some/path/_spark_meta` exists
// - fall back to `InMemoryFileIndex` otherwise
```
```scala
spark
.readStream
.format(...)
.load("/some/path/*/*") // for a glob path: always use `InMemoryFileIndex`
```
## How was this patch tested?
two newly added tests
Author: Liwei Lin <lwlin7@gmail.com>
Closes #16987 from lw-lin/source-read-from-sink.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions