aboutsummaryrefslogtreecommitdiff
path: root/docs/structured-streaming-programming-guide.md
diff options
context:
space:
mode:
authorLiwei Lin <lwlin7@gmail.com>2017-03-09 11:02:44 -0800
committerShixiong Zhu <shixiong@databricks.com>2017-03-09 11:02:44 -0800
commit40da4d181d648308de85fdcabc5c098ee861949a (patch)
tree270f1d888c4b879cb4c525f9208352e98e949624 /docs/structured-streaming-programming-guide.md
parent3232e54f2fcb8d2072cba4bc763ef29d5d8d325f (diff)
downloadspark-40da4d181d648308de85fdcabc5c098ee861949a.tar.gz
spark-40da4d181d648308de85fdcabc5c098ee861949a.tar.bz2
spark-40da4d181d648308de85fdcabc5c098ee861949a.zip
[SPARK-19715][STRUCTURED STREAMING] Option to Strip Paths in FileSource
## What changes were proposed in this pull request? Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing `s3n` to `s3a`). This patch adds an option `fileNameOnly` that causes the new file check to be based only on the filename (but still store the whole path in the log). ## Usage ```scala spark .readStream .option("fileNameOnly", true) .text("s3n://bucket/dir1/dir2") .writeStream ... ``` ## How was this patch tested? Added a test case Author: Liwei Lin <lwlin7@gmail.com> Closes #17120 from lw-lin/filename-only.
Diffstat (limited to 'docs/structured-streaming-programming-guide.md')
-rw-r--r--docs/structured-streaming-programming-guide.md12
1 files changed, 10 insertions, 2 deletions
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
index 6af47b6efb..995ac77a4f 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -1052,10 +1052,18 @@ Here are the details of all the sinks in Spark.
<td>Append</td>
<td>
<code>path</code>: path to the output directory, must be specified.
+ <br/>
<code>maxFilesPerTrigger</code>: maximum number of new files to be considered in every trigger (default: no max)
<br/>
- <code>latestFirst</code>: whether to processs the latest new files first, useful when there is a large backlog of files(default: false)
- <br/><br/>
+ <code>latestFirst</code>: whether to processs the latest new files first, useful when there is a large backlog of files (default: false)
+ <br/>
+ <code>fileNameOnly</code>: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:
+ <br/>
+ · "file:///dataset.txt"<br/>
+ · "s3://a/dataset.txt"<br/>
+ · "s3n://a/b/dataset.txt"<br/>
+ · "s3a://a/b/c/dataset.txt"<br/>
+ <br/>
For file-format-specific options, see the related methods in DataFrameWriter
(<a href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>).
E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>