Skip to content

Commit

Permalink
[Issue 5475][docs] Update message deduplication (apache#5512)
Browse files Browse the repository at this point in the history
* update message deduplication

* update
  • Loading branch information
Jennifer88huang-zz authored and merlimat committed Oct 31, 2019
1 parent bcce54d commit f01ff8a
Showing 1 changed file with 31 additions and 32 deletions.
63 changes: 31 additions & 32 deletions site2/docs/cookbooks-deduplication.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,69 +4,69 @@ title: Message deduplication
sidebar_label: Message deduplication
---

**Message deduplication** is a feature of Pulsar that, when enabled, ensures that each message produced on Pulsar topics is persisted to disk *only once*, even if the message is produced more than once. Message deduplication essentially unburdens Pulsar applications of the responsibility of ensuring deduplication and instead handles it automatically on the server side.
When **Message deduplication** is enabled, it ensures that each message produced on Pulsar topics is persisted to disk *only once*, even if the message is produced more than once. Message deduplication is handled automatically on the server side.

Using message deduplication in Pulsar involves making some [configuration changes](#configuration) to your Pulsar brokers as well as some minor changes to the behavior of Pulsar [clients](#clients).

> For a more thorough theoretical explanation of message deduplication, see the [Concepts and Architecture](concepts-messaging.md#message-deduplication) document.
To use message deduplication in Pulsar, you have to [configure](#configure-message-deduplication) your Pulsar brokers and [clients](#pulsar-clients).

> For more details on message deduplication, refer to [Concepts and Architecture](concepts-messaging.md#message-deduplication).
## How it works

Message deduplication can be enabled and disabled on a per-namespace basis. By default, it is *disabled* on all namespaces and can enabled in the following ways:
You can enable or disable message deduplication on a per-namespace basis. By default, it is *disabled* on all namespaces. You can enable it in the following ways:

* Using the [`pulsar-admin namespaces`](#enabling) interface
* As a broker-level [default](#default) for all namespaces
* Enable for all namespaces at the broker-level
* Enable for specific namespaces with the `pulsar-admin namespaces` interface

## Configuration for message deduplication
## Configure message deduplication

You can configure message deduplication in Pulsar using the [`broker.conf`](reference-configuration.md#broker) configuration file. The following deduplication-related parameters are available:
You can configure message deduplication in Pulsar using the [`broker.conf`](reference-configuration.md#broker) configuration file. The following deduplication-related parameters are available.

Parameter | Description | Default
:---------|:------------|:-------
`brokerDeduplicationEnabled` | Sets the default behavior for message deduplication in the Pulsar [broker](reference-terminology.md#broker). If set to `true`, message deduplication will be enabled by default on all namespaces; if set to `false` (the default), deduplication will have to be [enabled](#enabling) and [disabled](#disabling) on a per-namespace basis. | `false`
`brokerDeduplicationMaxNumberOfProducers` | The maximum number of producers for which information will be stored for deduplication purposes. | `10000`
`brokerDeduplicationEntriesInterval` | The number of entries after which a deduplication informational snapshot is taken. A larger interval will lead to fewer snapshots being taken, though this would also lengthen the topic recovery time (the time required for entries published after the snapshot to be replayed). | `1000`
`brokerDeduplicationProducerInactivityTimeoutMinutes` | The time of inactivity (in minutes) after which the broker will discard deduplication information related to a disconnected producer. | `360` (6 hours)
`brokerDeduplicationEnabled` | Sets the default behavior for message deduplication in the Pulsar [broker](reference-terminology.md#broker). If it is set to `true`, message deduplication is enabled by default on all namespaces; if it is set to `false` (the default), you have to enable or disable deduplication on a per-namespace basis. | `false`
`brokerDeduplicationMaxNumberOfProducers` | The maximum number of producers for which information is stored for deduplication purposes. | `10000`
`brokerDeduplicationEntriesInterval` | The number of entries after which a deduplication informational snapshot is taken. A larger interval leads to fewer snapshots being taken, though this lengthens the topic recovery time (the time required for entries published after the snapshot to be replayed). | `1000`
`brokerDeduplicationProducerInactivityTimeoutMinutes` | The time of inactivity (in minutes) after which the broker discards deduplication information related to a disconnected producer. | `360` (6 hours)

### Setting the broker-level default {#default}
### Set default value at the broker-level

By default, message deduplication is *disabled* on all Pulsar namespaces. To enable it by default on all namespaces, set the `brokerDeduplicationEnabled` parameter to `true` and re-start the broker.

Regardless of the value of `brokerDeduplicationEnabled`, [enabling](#enabling) and [disabling](#disabling) via the CLI will override the broker-level default.
Even if you set the value for `brokerDeduplicationEnabled`, enabling or disabling via Pulsar admin CLI will override the default settings at the broker-level.

### Enabling message deduplication {#enabling}
### Enable message deduplication

You can enable message deduplication on specific namespaces, regardless of the the [default](#default) for the broker, using the [`pulsar-admin namespace set-deduplication`](reference-pulsar-admin.md#namespace-set-deduplication) command. You can use the `--enable`/`-e` flag and specify the namespace. Here's an example with <tenant>/<namespace>:
Though message deduplication is disabled by default at broker-level, you can enable message deduplication for specific namespaces using the [`pulsar-admin namespace set-deduplication`](reference-pulsar-admin.md#namespace-set-deduplication) command. You can use the `--enable`/`-e` flag and specify the namespace. The following is an example with `<tenant>/<namespace>`:

```bash
$ bin/pulsar-admin namespaces set-deduplication \
public/default \
--enable # or just -e
```

### Disabling message deduplication {#disabling}
### Disable message deduplication

You can disable message deduplication on a specific namespace using the same method shown [above](#enabling), except using the `--disable`/`-d` flag instead. Here's an example with <tenant>/<namespace>:
Even if you enable message deduplication at broker-level, you can disable message deduplication for a specific namespace using the [`pulsar-admin namespace set-deduplication`](reference-pulsar-admin.md#namespace-set-deduplication) command. Use the `--disable`/`-d` flag and specify the namespace. The following is an example with `<tenant>/<namespace>`:

```bash
$ bin/pulsar-admin namespaces set-deduplication \
public/default \
--disable # or just -d
```

## Message deduplication and Pulsar clients {#clients}
## Pulsar clients

If you enable message deduplication in your Pulsar brokers, you won't need to make any major changes to your Pulsar clients. There are, however, two settings that you need to provide for your client producers:
If you enable message deduplication in Pulsar brokers, you need complete the following tasks for your client producers:

1. The producer must be given a name
1. The message send timeout needs to be set to infinity (i.e. no timeout)
1. Specify a name for the producer.
1. Set the message timeout to `0` (namely, no timeout).

Instructions for [Java](#java), [Python](#python), and [C++](#cpp) clients can be found below.
The instructions for Java, Python, and C++ clients are different.

### Java clients {#java}
<!--DOCUSAURUS_CODE_TABS-->
<!--Java clients-->

To enable message deduplication on a [Java producer](client-libraries-java.md#producers), set the producer name using the `producerName` setter and set the timeout to 0 using the `sendTimeout` setter. Here's an example:
To enable message deduplication on a [Java producer](client-libraries-java.md#producers), set the producer name using the `producerName` setter, and set the timeout to `0` using the `sendTimeout` setter.

```java
import org.apache.pulsar.client.api.Producer;
Expand All @@ -83,9 +83,9 @@ Producer producer = pulsarClient.newProducer()
.create();
```

### Python clients {#python}
<!--Python clients-->

To enable message deduplication on a [Python producer](client-libraries-python.md#producers), set the producer name using `producer_name` and the timeout to 0 using `send_timeout_millis`. Here's an example:
To enable message deduplication on a [Python producer](client-libraries-python.md#producers), set the producer name using `producer_name`, and set the timeout to `0` using `send_timeout_millis`.

```python
import pulsar
Expand All @@ -96,10 +96,9 @@ producer = client.create_producer(
producer_name="producer-1",
send_timeout_millis=0)
```
<!--C++ clients-->

### C++ clients {#cpp}

To enable message deduplication on a [C++ producer](client-libraries-cpp.md#producer), set the producer name using `producer_name` and the timeout to 0 using `send_timeout_millis`. Here's an example:
To enable message deduplication on a [C++ producer](client-libraries-cpp.md#producer), set the producer name using `producer_name`, and set the timeout to `0` using `send_timeout_millis`.

```cpp
#include <pulsar/Client.h>
Expand All @@ -118,4 +117,4 @@ Producer producer;

Result result = client.createProducer(topic, producerConfig, producer);
```
<!--END_DOCUSAURUS_CODE_TABS-->

0 comments on commit f01ff8a

Please sign in to comment.