[FLINK-20153] Describe time behaviour in execution mode docs

silenceland · Nov 25, 2020 · 203bbac · 203bbac
1 parent 10ae0f4
commit 203bbac
Showing 1 changed file with 46 additions and 0 deletions.
diff --git a/docs/dev/datastream_execution_mode.md b/docs/dev/datastream_execution_mode.md
@@ -232,8 +232,54 @@ information on this.
 
 ### Event Time / Watermarks
 
+When it comes to supporting [event time]({% link dev/event_time.md %}), Flink’s
+streaming runtime builds on the pessimistic assumption that events may come
+out-of-order, _i.e._ an event with timestamp `t` may come after an event with
+timestamp `t+1`. Because of this, the system can never be sure that no more
+elements with timestamp `t < T` for a given timestamp `T` can come in the
+future. To amortise the impact of this out-of-orderness on the final result
+while making the system practical, in `STREAMING` mode, Flink uses a heuristic
+called [Watermarks]({% link concepts/timely-stream-processing.md
+%}#event-time-and-watermarks). A watermark with timestamp `T` signals that no
+element with timestamp `t < T` will follow.
+
+In `BATCH` mode, where the input dataset is known in advance, there is no need
+for such a heuristic as, at the very least, elements can be sorted by timestamp
+so that they are processed in temporal order. For readers familiar with
+streaming, in `BATCH` we can assume “perfect watermarks”.
+
+Given the above, in `BATCH` mode, we only need a `MAX_WATERMARK` at the end of
+the input associated with each key, or at the end of input if the input stream
+is not keyed. Based on this scheme, all registered timers will fire at the *end
+of time* and user-defined `WatermarkAssigners` or `WatermarkStrategies` are
+ignored.
+
 ### Processing Time
 
+Processing Time is the wall-clock time on the machine that a record is
+processed, at the specific instance that the record is being processed. Based
+on this definition, we see that the results of a computation that is based on
+processing time are not reproducible. This is because the same record processed
+twice will have two different timestamps.
+
+Despite the above, using processing time in `STREAMING` mode can be useful. The
+reason has to do with the fact that streaming pipelines often ingest their
+unbounded input in *real time* so there is a correlation between event time and
+processing time. In addition, because of the above, in `STREAMING` mode `1h` in
+event time can often be almost `1h` in processing time, or wall-clock time. So
+using processing time can be used for early (incomplete) firings that give
+hints about the expected results.
+
+This correlation does not exist in the batch world where the input dataset is
+static and known in advance.  Given this, in `BATCH` mode we allow users to
+request the current processing time and register processing time timers, but,
+as in the case of Event Time, all the timers are going to fire at the end of
+the input.
+
+Conceptually, we can imagine that processing time does not advance during the
+execution of a job and we fast-forward to the *end of time* when the whole
+input is processed.
+
 ### Failure Recovery
 
 In `STREAMING` execution mode, Flink uses checkpoints for failure recovery.