Skip to content

Commit

Permalink
[FLINK-20153] Describe time behaviour in execution mode docs
Browse files Browse the repository at this point in the history
  • Loading branch information
kl0u authored and aljoscha committed Nov 25, 2020
1 parent 10ae0f4 commit 203bbac
Showing 1 changed file with 46 additions and 0 deletions.
46 changes: 46 additions & 0 deletions docs/dev/datastream_execution_mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -232,8 +232,54 @@ information on this.

### Event Time / Watermarks

When it comes to supporting [event time]({% link dev/event_time.md %}), Flink’s
streaming runtime builds on the pessimistic assumption that events may come
out-of-order, _i.e._ an event with timestamp `t` may come after an event with
timestamp `t+1`. Because of this, the system can never be sure that no more
elements with timestamp `t < T` for a given timestamp `T` can come in the
future. To amortise the impact of this out-of-orderness on the final result
while making the system practical, in `STREAMING` mode, Flink uses a heuristic
called [Watermarks]({% link concepts/timely-stream-processing.md
%}#event-time-and-watermarks). A watermark with timestamp `T` signals that no
element with timestamp `t < T` will follow.

In `BATCH` mode, where the input dataset is known in advance, there is no need
for such a heuristic as, at the very least, elements can be sorted by timestamp
so that they are processed in temporal order. For readers familiar with
streaming, in `BATCH` we can assume “perfect watermarks”.

Given the above, in `BATCH` mode, we only need a `MAX_WATERMARK` at the end of
the input associated with each key, or at the end of input if the input stream
is not keyed. Based on this scheme, all registered timers will fire at the *end
of time* and user-defined `WatermarkAssigners` or `WatermarkStrategies` are
ignored.

### Processing Time

Processing Time is the wall-clock time on the machine that a record is
processed, at the specific instance that the record is being processed. Based
on this definition, we see that the results of a computation that is based on
processing time are not reproducible. This is because the same record processed
twice will have two different timestamps.

Despite the above, using processing time in `STREAMING` mode can be useful. The
reason has to do with the fact that streaming pipelines often ingest their
unbounded input in *real time* so there is a correlation between event time and
processing time. In addition, because of the above, in `STREAMING` mode `1h` in
event time can often be almost `1h` in processing time, or wall-clock time. So
using processing time can be used for early (incomplete) firings that give
hints about the expected results.

This correlation does not exist in the batch world where the input dataset is
static and known in advance. Given this, in `BATCH` mode we allow users to
request the current processing time and register processing time timers, but,
as in the case of Event Time, all the timers are going to fire at the end of
the input.

Conceptually, we can imagine that processing time does not advance during the
execution of a job and we fast-forward to the *end of time* when the whole
input is processed.

### Failure Recovery

In `STREAMING` execution mode, Flink uses checkpoints for failure recovery.
Expand Down

0 comments on commit 203bbac

Please sign in to comment.