aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorRamkumar Venkataraman <rvenkataraman@paypal.com>2017-02-25 02:18:22 +0000
committerSean Owen <srowen@percale.home>2017-02-25 02:18:22 +0000
commit1b9ba258e086e2ba89a4f35a54106e2f8a38b525 (patch)
tree3d3e5bce18fd01eed36f8dab0d79a615970f493e /docs
parentfa7c582e9442b985a0493fb1dd15b3fb9b6031b4 (diff)
downloadspark-1b9ba258e086e2ba89a4f35a54106e2f8a38b525.tar.gz
spark-1b9ba258e086e2ba89a4f35a54106e2f8a38b525.tar.bz2
spark-1b9ba258e086e2ba89a4f35a54106e2f8a38b525.zip
[MINOR][DOCS] Fix few typos in structured streaming doc
## What changes were proposed in this pull request? Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix. ## How was this patch tested? N/A - since this is a doc fix. I did a jekyll build locally though. Author: Ramkumar Venkataraman <rvenkataraman@paypal.com> Closes #17037 from ramkumarvenkat/doc-fix.
Diffstat (limited to 'docs')
-rw-r--r--docs/structured-streaming-programming-guide.md8
1 files changed, 4 insertions, 4 deletions
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
index ad3b2fb26d..6af47b6efb 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -392,7 +392,7 @@ data, thus relieving the users from reasoning about it. As an example, let’s
see how this model handles event-time based processing and late arriving data.
## Handling Event-time and Late Data
-Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the even-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.
+Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the event-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.
Furthermore, this model naturally handles data that has arrived later than
expected based on its event-time. Since Spark is updating the Result Table,
@@ -401,7 +401,7 @@ as well as cleaning up old aggregates to limit the size of intermediate
state data. Since Spark 2.1, we have support for watermarking which
allows the user to specify the threshold of late data, and allows the engine
to accordingly clean up old state. These are explained later in more
-details in the [Window Operations](#window-operations-on-event-time) section.
+detail in the [Window Operations](#window-operations-on-event-time) section.
## Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers)
@@ -647,7 +647,7 @@ df.groupBy("deviceType").count()
</div>
### Window Operations on Event Time
-Aggregations over a sliding event-time window are straightforward with Structured Streaming. The key idea to understand about window-based aggregations are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration.
+Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration.
Imagine our [quick example](#quick-example) is modified and the stream now contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time).
@@ -713,7 +713,7 @@ old windows correctly, as illustrated below.
![Handling Late Data](img/structured-streaming-late-data.png)
-However, to run this query for days, its necessary for the system to bound the amount of
+However, to run this query for days, it's necessary for the system to bound the amount of
intermediate in-memory state it accumulates. This means the system needs to know when an old
aggregate can be dropped from the in-memory state because the application is not going to receive
late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced