Add documentation for HDFS offload (apache#5762)

vruc · Dec 2, 2019 · 078ba44 · 078ba44
1 parent f310ab0
commit 078ba44
Show file tree

Hide file tree

Showing 3 changed files with 63 additions and 2 deletions.
diff --git a/site2/docs/assets/pulsar-tiered-storage.png b/site2/docs/assets/pulsar-tiered-storage.png
diff --git a/site2/docs/concepts-tiered-storage.md b/site2/docs/concepts-tiered-storage.md
@@ -12,6 +12,6 @@ One way to alleviate this cost is to use Tiered Storage. With tiered storage, ol
 
 > Data written to BookKeeper is replicated to 3 physical machines by default. However, once a segment is sealed in BookKeeper it becomes immutable and can be copied to long term storage. Long term storage can achieve cost savings by using mechanisms such as [Reed-Solomon error correction](https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction) to require fewer physical copies of data.
 
-Pulsar currently supports S3 and Google Cloud Storage (GCS) for [long term store](https://pulsar.apache.org/docs/en/cookbooks-tiered-storage/). Offloading to long term storage triggered via a Rest API or command line interface. The user passes in the amount of topic data they wish to retain on BookKeeper, and the broker will copy the backlog data to long term storage. The original data will then be deleted from BookKeeper after a configured delay (4 hours by default).
+Pulsar currently supports S3, Google Cloud Storage (GCS), and filesystem for [long term store](https://pulsar.apache.org/docs/en/cookbooks-tiered-storage/). Offloading to long term storage triggered via a Rest API or command line interface. The user passes in the amount of topic data they wish to retain on BookKeeper, and the broker will copy the backlog data to long term storage. The original data will then be deleted from BookKeeper after a configured delay (4 hours by default).
 
 > For a guide for setting up tiered storage, see the [Tiered storage cookbook](cookbooks-tiered-storage.md).
diff --git a/site2/docs/cookbooks-tiered-storage.md b/site2/docs/cookbooks-tiered-storage.md
@@ -6,11 +6,14 @@ sidebar_label: Tiered Storage
 
 Pulsar's **Tiered Storage** feature allows older backlog data to be offloaded to long term storage, thereby freeing up space in BookKeeper and reducing storage costs. This cookbook walks you through using tiered storage in your Pulsar cluster.
 
-Tiered storage currently uses [Apache Jclouds](https://jclouds.apache.org) to supports
+* Tiered storage uses [Apache jclouds](https://jclouds.apache.org) to support
 [Amazon S3](https://aws.amazon.com/s3/) and [Google Cloud Storage](https://cloud.google.com/storage/)(GCS for short)
 for long term storage. With Jclouds, it is easy to add support for more
 [cloud storage providers](https://jclouds.apache.org/reference/providers/#blobstore-providers) in the future.
 
+* Tiered storage uses [Apache Hadoop](http://hadoop.apache.org/) to support filesystem for long term storage. 
+With Hadoop, it is easy to add support for more filesystem in the future.
+
 ## When should I use Tiered Storage?
 
 Tiered storage should be used when you have a topic for which you want to keep a very long backlog for a long time. For example, if you have a topic containing user actions which you use to train your recommendation systems, you may want to keep that data for a long time, so that if you change your recommendation algorithm you can rerun it against your full user history.
@@ -41,6 +44,7 @@ Currently we support driver of types:
 
 - `aws-s3`: [Simple Cloud Storage Service](https://aws.amazon.com/s3/)
 - `google-cloud-storage`: [Google Cloud Storage](https://cloud.google.com/storage/)
+- `filesystem`: [Filesystem Storage](http://hadoop.apache.org/)
 
 > Driver names are case-insensitive for driver's name. There is a third driver type, `s3`, which is identical to `aws-s3`,
 > though it requires that you specify an endpoint url using `s3ManagedLedgerOffloadServiceEndpoint`. This is useful if
@@ -186,6 +190,63 @@ Pulsar also provides some knobs to configure the size of requests sent to GCS.
 
 In both cases, these should not be touched unless you know what you are doing.
 
+### "filesystem" Driver configuration
+
+
+#### Configure connection address
+
+You can configure the connection address in the `broker.conf` file.
+
+```conf
+fileSystemURI="hdfs://127.0.0.1:9000"
+```
+#### Configure Hadoop profile path
+
+The configuration file is stored in the Hadoop profile path. It contains various settings, such as base path, authentication, and so on.
+
+```conf
+fileSystemProfilePath="../conf/filesystem_offload_core_site.xml"
+```
+
+The model for storing topic data uses `org.apache.hadoop.io.MapFile`. You can use all of the configurations in `org.apache.hadoop.io.MapFile` for Hadoop.
+
+**Example**
+
+```conf
+
+    <property>
+        <name>fs.defaultFS</name>
+        <value></value>
+    </property>
+    
+    <property>
+        <name>hadoop.tmp.dir</name>
+        <value>pulsar</value>
+    </property>
+    
+    <property>
+        <name>io.file.buffer.size</name>
+        <value>4096</value>
+    </property>
+    
+    <property>
+        <name>io.seqfile.compress.blocksize</name>
+        <value>1000000</value>
+    </property>
+    <property>
+    
+        <name>io.seqfile.compression.type</name>
+        <value>BLOCK</value>
+    </property>
+    
+    <property>
+        <name>io.map.index.interval</name>
+        <value>128</value>
+    </property>
+    
+```
+
+For more information about the configurations in `org.apache.hadoop.io.MapFile`, see [Filesystem Storage](http://hadoop.apache.org/).
 ## Configuring offload to run automatically
 
 Namespace policies can be configured to offload data automatically once a threshold is reached. The threshold is based on the size of data that the topic has stored on the pulsar cluster. Once the topic reaches the threshold, an offload operation will be triggered. Setting a negative value to the threshold will disable automatic offloading. Setting the threshold to 0 will cause the broker to offload data as soon as it possiby can.