From 40da4d181d648308de85fdcabc5c098ee861949a Mon Sep 17 00:00:00 2001 From: Liwei Lin Date: Thu, 9 Mar 2017 11:02:44 -0800 Subject: [SPARK-19715][STRUCTURED STREAMING] Option to Strip Paths in FileSource ## What changes were proposed in this pull request? Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing `s3n` to `s3a`). This patch adds an option `fileNameOnly` that causes the new file check to be based only on the filename (but still store the whole path in the log). ## Usage ```scala spark .readStream .option("fileNameOnly", true) .text("s3n://bucket/dir1/dir2") .writeStream ... ``` ## How was this patch tested? Added a test case Author: Liwei Lin Closes #17120 from lw-lin/filename-only. --- docs/structured-streaming-programming-guide.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'docs/structured-streaming-programming-guide.md') diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 6af47b6efb..995ac77a4f 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -1052,10 +1052,18 @@ Here are the details of all the sinks in Spark. Append path: path to the output directory, must be specified. +
maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)
- latestFirst: whether to processs the latest new files first, useful when there is a large backlog of files(default: false) -

+ latestFirst: whether to processs the latest new files first, useful when there is a large backlog of files (default: false) +
+ fileNameOnly: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same: +
+ · "file:///dataset.txt"
+ · "s3://a/dataset.txt"
+ · "s3n://a/b/dataset.txt"
+ · "s3a://a/b/c/dataset.txt"
+
For file-format-specific options, see the related methods in DataFrameWriter (Scala/Java/Python). E.g. for "parquet" format options see DataFrameWriter.parquet() -- cgit v1.2.3