diff --git a/docs/learn-flink/etl.md b/docs/learn-flink/etl.md index dc42fa96c346d..78074ad70dc95 100644 --- a/docs/learn-flink/etl.md +++ b/docs/learn-flink/etl.md @@ -149,7 +149,7 @@ this would mean doing some sort of GROUP BY with the `startCell`, while in Flink {% highlight java %} rides .flatMap(new NYCEnrichment()) - .keyBy(value -> value.startCell) + .keyBy(enrichedRide -> enrichedRide.startCell) {% endhighlight %} Every `keyBy` causes a network shuffle that repartitions the stream. In general this is pretty @@ -157,32 +157,6 @@ expensive, since it involves network communication along with serialization and keyBy and network shuffle -In the example above, the key has been specified by a field name, "startCell". This style of key -selection has the drawback that the compiler is unable to infer the type of the field being used for -keying, and so Flink will pass around the key values as Tuples, which can be awkward. It is -better to use a properly typed KeySelector, e.g., - -{% highlight java %} -rides - .flatMap(new NYCEnrichment()) - .keyBy( - new KeySelector() { - - @Override - public int getKey(EnrichedRide enrichedRide) throws Exception { - return enrichedRide.startCell; - } - }) -{% endhighlight %} - -which can be more succinctly expressed with a lambda: - -{% highlight java %} -rides - .flatMap(new NYCEnrichment()) - .keyBy(enrichedRide -> enrichedRide.startCell) -{% endhighlight %} - ### Keys are computed KeySelectors aren't limited to extracting a key from your events. They can, instead, diff --git a/docs/learn-flink/etl.zh.md b/docs/learn-flink/etl.zh.md index 190240b9e3b34..08ef552506a27 100644 --- a/docs/learn-flink/etl.zh.md +++ b/docs/learn-flink/etl.zh.md @@ -130,36 +130,13 @@ public static class NYCEnrichment implements FlatMapFunction value.startCell) + .keyBy(enrichedRide -> enrichedRide.startCell) {% endhighlight %} 每个 `keyBy` 会通过 shuffle 来为数据流进行重新分区。总体来说这个开销是很大的,它涉及网络通信、序列化和反序列化。 keyBy and network shuffle -在上面的例子中,将 "startCell" 这个字段定义为键。这种选择键的方式有个缺点,就是编译器无法推断用作键的字段的类型,所以 Flink 会将键值作为元组传递,这有时候会比较难处理。所以最好还是使用一个合适的 KeySelector, - -{% highlight java %} -rides - .flatMap(new NYCEnrichment()) - .keyBy( - new KeySelector() { - - @Override - public int getKey(EnrichedRide enrichedRide) throws Exception { - return enrichedRide.startCell; - } - }) -{% endhighlight %} - -也可以使用更简洁的 lambda 表达式: - -{% highlight java %} -rides - .flatMap(new NYCEnrichment()) - .keyBy(enrichedRide -> enrichedRide.startCell) -{% endhighlight %} - ### 通过计算得到键 KeySelector 不仅限于从事件中抽取键。你也可以按想要的方式计算得到键值,只要最终结果是确定的,并且实现了 `hashCode()` 和 `equals()`。这些限制条件不包括产生随机数或者返回 Arrays 或 Enums 的 KeySelector,但你可以用元组和 POJO 来组成键,只要他们的元素遵循上述条件。