[SPARK-19715][STRUCTURED STREAMING] Option to Strip Paths in FileSource

## What changes were proposed in this pull request? Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing `s3n` to `s3a`). This patch adds an option `fileNameOnly` that causes the new file check to be based only on the filename (but still store the whole path in the log). ## Usage ```scala spark .readStream .option("fileNameOnly", true) .text("s3n://bucket/dir1/dir2") .writeStream ... ``` ## How was this patch tested? Added a test case Author: Liwei Lin <lwlin7@gmail.com> Closes #17120 from lw-lin/filename-only.
author: Liwei Lin <lwlin7@gmail.com> 2017-03-09 11:02:44 -0800
committer: Shixiong Zhu <shixiong@databricks.com> 2017-03-09 11:02:44 -0800
commit: 40da4d181d648308de85fdcabc5c098ee861949a (patch)
tree: 270f1d888c4b879cb4c525f9208352e98e949624 /docs/structured-streaming-programming-guide.md
parent: 3232e54f2fcb8d2072cba4bc763ef29d5d8d325f (diff)
download: spark-40da4d181d648308de85fdcabc5c098ee861949a.tar.gz
spark-40da4d181d648308de85fdcabc5c098ee861949a.tar.bz2
spark-40da4d181d648308de85fdcabc5c098ee861949a.zip
1 files changed, 10 insertions, 2 deletions
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
index 6af47b6efb..995ac77a4f 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1052,10 +1052,18 @@ Here are the details of all the sinks in Spark.
     <td>Append</td>
     <td>
         <code>path</code>: path to the output directory, must be specified.
+        <br/>
         <code>maxFilesPerTrigger</code>: maximum number of new files to be considered in every trigger (default: no max)
         <br/>
-        <code>latestFirst</code>: whether to processs the latest new files first, useful when there is a large backlog of files(default: false)
-        <br/><br/>
+        <code>latestFirst</code>: whether to processs the latest new files first, useful when there is a large backlog of files (default: false)
+        <br/>
+        <code>fileNameOnly</code>: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:
+        <br/>
+        · "file:///dataset.txt"<br/>
+        · "s3://a/dataset.txt"<br/>
+        · "s3n://a/b/dataset.txt"<br/>
+        · "s3a://a/b/c/dataset.txt"<br/>
+        <br/>
         For file-format-specific options, see the related methods in DataFrameWriter
         (<a href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>).
         E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>
author	Liwei Lin <lwlin7@gmail.com>	2017-03-09 11:02:44 -0800
committer	Shixiong Zhu <shixiong@databricks.com>	2017-03-09 11:02:44 -0800
commit	40da4d181d648308de85fdcabc5c098ee861949a (patch)
tree	270f1d888c4b879cb4c525f9208352e98e949624 /docs/structured-streaming-programming-guide.md
parent	3232e54f2fcb8d2072cba4bc763ef29d5d8d325f (diff)
download	spark-40da4d181d648308de85fdcabc5c098ee861949a.tar.gz spark-40da4d181d648308de85fdcabc5c098ee861949a.tar.bz2 spark-40da4d181d648308de85fdcabc5c098ee861949a.zip