Skip to content

Commit

Permalink
Add documentation for HDFS offload (apache#5762)
Browse files Browse the repository at this point in the history
  • Loading branch information
congbobo184 authored and codelipenghui committed Dec 2, 2019
1 parent f310ab0 commit 078ba44
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 2 deletions.
Binary file modified site2/docs/assets/pulsar-tiered-storage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion site2/docs/concepts-tiered-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,6 @@ One way to alleviate this cost is to use Tiered Storage. With tiered storage, ol

> Data written to BookKeeper is replicated to 3 physical machines by default. However, once a segment is sealed in BookKeeper it becomes immutable and can be copied to long term storage. Long term storage can achieve cost savings by using mechanisms such as [Reed-Solomon error correction](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction) to require fewer physical copies of data.
Pulsar currently supports S3 and Google Cloud Storage (GCS) for [long term store](https://pulsar.apache.org/docs/en/cookbooks-tiered-storage/). Offloading to long term storage triggered via a Rest API or command line interface. The user passes in the amount of topic data they wish to retain on BookKeeper, and the broker will copy the backlog data to long term storage. The original data will then be deleted from BookKeeper after a configured delay (4 hours by default).
Pulsar currently supports S3, Google Cloud Storage (GCS), and filesystem for [long term store](https://pulsar.apache.org/docs/en/cookbooks-tiered-storage/). Offloading to long term storage triggered via a Rest API or command line interface. The user passes in the amount of topic data they wish to retain on BookKeeper, and the broker will copy the backlog data to long term storage. The original data will then be deleted from BookKeeper after a configured delay (4 hours by default).

> For a guide for setting up tiered storage, see the [Tiered storage cookbook](cookbooks-tiered-storage.md).
63 changes: 62 additions & 1 deletion site2/docs/cookbooks-tiered-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,14 @@ sidebar_label: Tiered Storage

Pulsar's **Tiered Storage** feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster.

Tiered storage currently uses [Apache Jclouds](https://jclouds.apache.org) to supports
* Tiered storage uses [Apache jclouds](https://jclouds.apache.org) to support
[Amazon S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/)(GCS for short)
for long term storage. With Jclouds, it is easy to add support for more
[cloud storage providers](https://jclouds.apache.org/reference/providers/#blobstore-providers) in the future.

* Tiered storage uses [Apache Hadoop](http://hadoop.apache.org/) to support filesystem for long term storage.
With Hadoop, it is easy to add support for more filesystem in the future.

## When should I use Tiered Storage?

Tiered storage should be used when you have a topic for which you want to keep a very long backlog for a long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time, so that if you change your recommendation algorithm you can rerun it against your full user history.
Expand Down Expand Up @@ -41,6 +44,7 @@ Currently we support driver of types:

- `aws-s3`: [Simple Cloud Storage Service](https://aws.amazon.com/s3/)
- `google-cloud-storage`: [Google Cloud Storage](https://cloud.google.com/storage/)
- `filesystem`: [Filesystem Storage](http://hadoop.apache.org/)

> Driver names are case-insensitive for driver's name. There is a third driver type, `s3`, which is identical to `aws-s3`,
> though it requires that you specify an endpoint url using `s3ManagedLedgerOffloadServiceEndpoint`. This is useful if
Expand Down Expand Up @@ -186,6 +190,63 @@ Pulsar also provides some knobs to configure the size of requests sent to GCS.

In both cases, these should not be touched unless you know what you are doing.

### "filesystem" Driver configuration


#### Configure connection address

You can configure the connection address in the `broker.conf` file.

```conf
fileSystemURI="hdfs://127.0.0.1:9000"
```
#### Configure Hadoop profile path

The configuration file is stored in the Hadoop profile path. It contains various settings, such as base path, authentication, and so on.

```conf
fileSystemProfilePath="../conf/filesystem_offload_core_site.xml"
```

The model for storing topic data uses `org.apache.hadoop.io.MapFile`. You can use all of the configurations in `org.apache.hadoop.io.MapFile` for Hadoop.

**Example**

```conf
<property>
<name>fs.defaultFS</name>
<value></value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>pulsar</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
<property>
<name>io.seqfile.compress.blocksize</name>
<value>1000000</value>
</property>
<property>
<name>io.seqfile.compression.type</name>
<value>BLOCK</value>
</property>
<property>
<name>io.map.index.interval</name>
<value>128</value>
</property>
```

For more information about the configurations in `org.apache.hadoop.io.MapFile`, see [Filesystem Storage](http://hadoop.apache.org/).
## Configuring offload to run automatically

Namespace policies can be configured to offload data automatically once a threshold is reached. The threshold is based on the size of data that the topic has stored on the pulsar cluster. Once the topic reaches the threshold, an offload operation will be triggered. Setting a negative value to the threshold will disable automatic offloading. Setting the threshold to 0 will cause the broker to offload data as soon as it possiby can.
Expand Down

0 comments on commit 078ba44

Please sign in to comment.