Skip to content

Commit

Permalink
[FLINK-20296][training-docs] remove obsolete content about keyBy(string)
Browse files Browse the repository at this point in the history
  • Loading branch information
alpinegizmo committed Nov 23, 2020
1 parent fd663e4 commit 1b8385d
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 51 deletions.
28 changes: 1 addition & 27 deletions docs/learn-flink/etl.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,40 +149,14 @@ this would mean doing some sort of GROUP BY with the `startCell`, while in Flink
{% highlight java %}
rides
.flatMap(new NYCEnrichment())
.keyBy(value -> value.startCell)
.keyBy(enrichedRide -> enrichedRide.startCell)
{% endhighlight %}

Every `keyBy` causes a network shuffle that repartitions the stream. In general this is pretty
expensive, since it involves network communication along with serialization and deserialization.

<img src="{% link /fig/keyBy.png %}" alt="keyBy and network shuffle" class="offset" width="45%" />

In the example above, the key has been specified by a field name, "startCell". This style of key
selection has the drawback that the compiler is unable to infer the type of the field being used for
keying, and so Flink will pass around the key values as Tuples, which can be awkward. It is
better to use a properly typed KeySelector, e.g.,

{% highlight java %}
rides
.flatMap(new NYCEnrichment())
.keyBy(
new KeySelector<EnrichedRide, int>() {

@Override
public int getKey(EnrichedRide enrichedRide) throws Exception {
return enrichedRide.startCell;
}
})
{% endhighlight %}

which can be more succinctly expressed with a lambda:

{% highlight java %}
rides
.flatMap(new NYCEnrichment())
.keyBy(enrichedRide -> enrichedRide.startCell)
{% endhighlight %}

### Keys are computed

KeySelectors aren't limited to extracting a key from your events. They can, instead,
Expand Down
25 changes: 1 addition & 24 deletions docs/learn-flink/etl.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,36 +130,13 @@ public static class NYCEnrichment implements FlatMapFunction<TaxiRide, EnrichedR
{% highlight java %}
rides
.flatMap(new NYCEnrichment())
.keyBy(value -> value.startCell)
.keyBy(enrichedRide -> enrichedRide.startCell)
{% endhighlight %}

每个 `keyBy` 会通过 shuffle 来为数据流进行重新分区。总体来说这个开销是很大的,它涉及网络通信、序列化和反序列化。

<img src="{% link /fig/keyBy.png %}" alt="keyBy and network shuffle" class="offset" width="45%" />

在上面的例子中,将 "startCell" 这个字段定义为键。这种选择键的方式有个缺点,就是编译器无法推断用作键的字段的类型,所以 Flink 会将键值作为元组传递,这有时候会比较难处理。所以最好还是使用一个合适的 KeySelector,

{% highlight java %}
rides
.flatMap(new NYCEnrichment())
.keyBy(
new KeySelector<EnrichedRide, int>() {

@Override
public int getKey(EnrichedRide enrichedRide) throws Exception {
return enrichedRide.startCell;
}
})
{% endhighlight %}

也可以使用更简洁的 lambda 表达式:

{% highlight java %}
rides
.flatMap(new NYCEnrichment())
.keyBy(enrichedRide -> enrichedRide.startCell)
{% endhighlight %}

### 通过计算得到键

KeySelector 不仅限于从事件中抽取键。你也可以按想要的方式计算得到键值,只要最终结果是确定的,并且实现了 `hashCode()``equals()`。这些限制条件不包括产生随机数或者返回 Arrays 或 Enums 的 KeySelector,但你可以用元组和 POJO 来组成键,只要他们的元素遵循上述条件。
Expand Down

0 comments on commit 1b8385d

Please sign in to comment.