Merge pull request apache#436 from r0ann3l/NUTCH-2684

NUTCH-2684 README.md file for index writer plugins.
petapro · Feb 22, 2019 · 78af89f · 78af89f
2 parents e95c915 + e27eb65
commit 78af89f
Show file tree

Hide file tree

Showing 7 changed files with 273 additions and 27 deletions.
diff --git a/src/plugin/indexer-cloudsearch/README.md b/src/plugin/indexer-cloudsearch/README.md
@@ -3,56 +3,56 @@ AWS CloudSearch plugin for Nutch
 
 See [http://aws.amazon.com/cloudsearch/] for information on AWS CloudSearch.
 
-Steps to use :
+**indexer-cloudsearch plugin** is used for sending documents from one or more segments to Amazon CloudSearch. The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:
 
-From runtime/local/bin
+```xml
+<writer id="<writer_id>" class="org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter">
+  <mapping>
+    ...
+  </mapping>
+  <parameters>
+    ...
+  </parameters>
+</writer>
+```
 
-* Configure the AWS credentials 
+Each `<writer>` element has two mandatory attributes:
 
-Edit `~/.aws/credentials`, see [http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html] for details. Note that this should not be necessary when running Nutch on EC2.
+* `<writer_id>` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations.
 
-* Edit ../conf/nutch-site.xml and check that 'plugin.includes' contains 'indexer-cloudsearch'. 
+* `org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter` corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the **indexer-cloudsearch plugin**.
 
-* (Optional) Test the indexing 
+## Mapping
 
-`./nutch indexchecker -D doIndex=true -D cloudsearch.batch.dump=true "http://nutch.apache.org/"`
+The mapping section is explained [here](https://wiki.apache.org/nutch/IndexWriters#Mapping_section). The structure of this section is general for all index writers.
 
-if the agent name hasn't been configured in nutch-site.xml, it can be added on the command line with `-D http.agent.name=whateverValueDescribesYouBest`
+## Parameters
 
-you should see the fields extracted for the indexing coming up on the console.
+Each parameter has the form `<param name="<name>" value="<value>"/>` and the parameters for this index writer are:
 
-Using the `cloudsearch.batch.dump` parameter allows to dump the batch to the local temp dir. The files has the prefix "CloudSearch_" e.g. `/tmp/CloudSearch_4822180575734804454.json`. This temp file can be used as a template when defining the fields in the domain creation (see below).
+Parameter Name | Description | Default value
+--|--|--
+endpoint | Endpoint where service requests should be submitted. | 
+region | Region name. | 
+batch.dump | **true** to store the JSON representation of the documents to a local temp dir. The files has the prefix "CloudSearch_" e.g. `/tmp/CloudSearch_4822180575734804454.json`. This temp file can be used as a template when defining the fields in the domain creation. | false
+batch.maxSize | Maximum number of documents to send as a batch to CloudSearch. | -1
 
-* Create a CloudSearch domain
+## Create a CloudSearch domain
 
 This can be done using the web console [https://eu-west-1.console.aws.amazon.com/cloudsearch/home?region=eu-west-1#]. You can use the temp file generated above to bootstrap the field definition. 
 
 You can also create the domain using the AWS CLI [http://docs.aws.amazon.com/cloudsearch/latest/developerguide/creating-domains.html] and the `createCSDomain.sh` example script provided. This script is merely as starting point which you should further improve and fine tune. 
 
 Note that the creation of the domain can take some time. Once it is complete, note the document endpoint, or alternatively verify the region and domain name.
 
-* Edit ../conf/nutch-site.xml and add `cloudsearch.endpoint` and `cloudsearch.region`. 
+> The CloudSearchIndexWriter will log any errors while sending the batches to CloudSearch and will resume the process without breaking. This means that you might not get all the documents in the index. You should check the log files for errors. Using small batch sizes will limit the number of documents skipped in case of error.
 
-* Re-test the indexing
+> Any fields not defined in the CloudSearch domain will be ignored by the CloudSearchIndexWriter. Again, the logs will contain a trace of any field names skipped.
 
-`./nutch indexchecker -D doIndex=true "http://nutch.apache.org/"`
 
-and check in the CloudSearch console that the document has been succesfully indexed.
 
-Additional parameters
 
-* `cloudsearch.batch.maxSize` \: can be used to limit the size of the batches sent to CloudSearch to N documents. Note that the default limitations still apply.
 
-* `cloudsearch.batch.dump` \: see above. Stores the JSON representation of the document batch in the local temp dir, useful for bootstrapping the index definition.
 
-Note
-
-The CloudSearchIndexWriter will log any errors while sending the batches to CloudSearch and will resume the process without breaking. This means that you might not get all the documents in the index. You should check the log files for errors. Using small batch sizes will limit the number of documents skipped in case of error.
-
-Any fields not defined in the CloudSearch domain will be ignored by the CloudSearchIndexWriter. Again, the logs will contain a trace of any field names skipped.
-
-
-
-
 
 
diff --git a/src/plugin/indexer-csv/README.md b/src/plugin/indexer-csv/README.md
@@ -0,0 +1,42 @@
+indexer-csv plugin for Nutch 
+============================
+
+**indexer-csv plugin** is used for writing documents to a CSV file. It does not work in distributed mode, the output is written to the local filesystem, not to HDFS, see [NUTCH-1541](https://issues.apache.org/jira/browse/NUTCH-1541). The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:
+
+```xml
+<writer id="<writer_id>" class="org.apache.nutch.indexwriter.csv.CSVIndexWriter">
+  <mapping>
+    ...
+  </mapping>
+  <parameters>
+    ...
+  </parameters>   
+</writer>
+```
+
+Each `<writer>` element has two mandatory attributes:
+
+* `<writer_id>` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations.
+
+* `org.apache.nutch.indexwriter.csv.CSVIndexWriter` corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the **indexer-csv plugin**.
+
+## Mapping
+
+The mapping section is explained [here](https://wiki.apache.org/nutch/IndexWriters#Mapping_section). The structure of this section is general for all index writers.
+
+## Parameters
+
+Each parameter has the form `<param name="<name>" value="<value>"/>` and the parameters for this index writer are:
+
+Parameter Name | Description | Default value
+--|--|--
+fields | Ordered list of fields (columns) in the CSV file | id,title,content
+charset | Encoding of CSV file | UTF-8
+separator | Separator between fields (columns) | ,
+valuesep | Separator between multiple values of one field | \|
+quotechar | Quote character used to quote fields containing separators or quotes | &quot;
+escapechar | Escape character used to escape a quote character | &quot;
+maxfieldlength | Max. length of a single field value in characters | 4096
+maxfieldvalues | Max. number of values of one field, useful for, e.g., the anchor texts field | 12
+header | Write CSV column headers | true
+outpath | Output path / directory (local filesystem path, relative to current working directory) | csvindexwriter
diff --git a/src/plugin/indexer-dummy/README.md b/src/plugin/indexer-dummy/README.md
@@ -0,0 +1,34 @@
+indexer-dummy plugin for Nutch 
+==============================
+
+**indexer-dummy plugin** is used for writing "action"\t"url"\n lines to a plain text file for debugging purposes. It does not work in distributed mode, the output is written to the local filesystem, not to HDFS. The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:
+
+```xml
+<writer id="<writer_id>" class="org.apache.nutch.indexwriter.dummy.DummyIndexWriter">
+  <mapping>
+    ...
+  </mapping>
+  <parameters>
+    ...
+  </parameters>   
+</writer>
+```
+
+Each `<writer>` element has two mandatory attributes:
+
+* `<writer_id>` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations.
+
+* `org.apache.nutch.indexwriter.dummy.DummyIndexWriter` corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the **indexer-dummy plugin**.
+
+## Mapping
+
+The mapping section is explained [here](https://wiki.apache.org/nutch/IndexWriters#Mapping_section). The structure of this section is general for all index writers.
+
+## Parameters
+
+Each parameter has the form `<param name="<name>" value="<value>"/>` and the parameters for this index writer are:
+
+Parameter Name | Description | Default value
+--|--|--
+ path | Path where the file will be created. | ./dummy-index.txt
+ delete | If delete operations should be written to the file. | false
diff --git a/src/plugin/indexer-elastic-rest/README.md b/src/plugin/indexer-elastic-rest/README.md
@@ -0,0 +1,45 @@
+indexer-elastic-rest plugin for Nutch 
+=====================================
+
+**indexer-elastic-rest plugin** is used for sending documents from one or more segments to Elasticsearch, but using Jest to connect with the REST API provided by Elasticsearch. The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:
+
+```xml
+<writer id="<writer_id>" class="org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter">
+  <mapping>
+    ...
+  </mapping>
+  <parameters>
+    ...
+  </parameters>   
+</writer>
+```
+
+Each `<writer>` element has two mandatory attributes:
+
+* `<writer_id>` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations.
+
+* `org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter` corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the **indexer-elastic-rest plugin**.
+
+## Mapping
+
+The mapping section is explained [here](https://wiki.apache.org/nutch/IndexWriters#Mapping_section). The structure of this section is general for all index writers.
+
+## Parameters
+
+Each parameter has the form `<param name="<name>" value="<value>"/>` and the parameters for this index writer are:
+
+Parameter Name | Description | Default value
+--|--|--
+host | The hostname or a list of comma separated hostnames to send documents to using Elasticsearch Jest. Both host and port must be defined. |  
+port | The port to connect to using Elasticsearch Jest. | 9200
+index | Default index to send documents to. | nutch
+max.bulk.docs | Maximum size of the bulk in number of documents. | 250
+max.bulk.size | Maximum size of the bulk in bytes. | 2500500
+user | Username for auth credentials (only used when https is enabled) | user
+password | Password for auth credentials (only used when https is enabled) | password
+type | Default type to send documents to. | doc
+https | **true** to enable https, **false** to disable https. If you've disabled http access (by forcing https), be sure to set this to true, otherwise you might get "connection reset by peer". | false
+trustallhostnames | **true** to trust elasticsearch server's certificate even if its listed domain name does not match the domain they are hosted or **false** to check if the elasticsearch server's certificate's listed domain is the same domain that it is hosted on, and if it doesn't, then fail to index (only used when https is enabled) | false
+languages | A list of strings denoting the supported languages (e.g. `en, de, fr, it`). If this value is empty all documents will be sent to index property. If not empty the Rest client will distribute documents in different indices based on their `languages` property. Indices are named with the following schema: `index separator language` (e.g. `nutch_de`). Entries with an unsupported `languages` value will be added to index `index separator sink` (e.g. `nutch_others`). | 
+separator | Is used only if `languages` property is defined to build the index name (i.e. `index separator lang`). | _
+sink | Is used only if `languages` property is defined to build the index name where to store documents with unsupported languages (i.e. `index separator sink`). | others 
diff --git a/src/plugin/indexer-elastic/README.md b/src/plugin/indexer-elastic/README.md
@@ -0,0 +1,41 @@
+indexer-elastic plugin for Nutch 
+================================
+
+**indexer-elastic plugin** is used for sending documents from one or more segments to an Elasticsearch server. The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:
+
+```xml
+<writer id="<writer_id>" class="org.apache.nutch.indexwriter.elastic.ElasticIndexWriter">
+  <mapping>
+    ...
+  </mapping>
+  <parameters>
+    ...
+  </parameters>   
+</writer>
+```
+
+Each `<writer>` element has two mandatory attributes:
+
+* `<writer_id>` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations.
+
+* `org.apache.nutch.indexwriter.elastic.ElasticIndexWriter` corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the **indexer-elastic plugin**.
+
+## Mapping
+
+The mapping section is explained [here](https://wiki.apache.org/nutch/IndexWriters#Mapping_section). The structure of this section is general for all index writers.
+
+## Parameters
+
+Each parameter has the form `<param name="<name>" value="<value>"/>` and the parameters for this index writer are:
+
+Parameter Name | Description | Default value
+--|--|--
+host | Comma-separated list of hostnames to send documents to using [TransportClient](https://static.javadoc.io/org.elasticsearch/elasticsearch/5.3.0/org/elasticsearch/client/transport/TransportClient.html). Either host and port must be defined or cluster. | 
+port | The port to connect to using [TransportClient](https://static.javadoc.io/org.elasticsearch/elasticsearch/5.3.0/org/elasticsearch/client/transport/TransportClient.html). | 9300
+cluster | The cluster name to discover. Either host and port must be defined or cluster. | 
+index | Default index to send documents to. | nutch
+max.bulk.docs | Maximum size of the bulk in number of documents. | 250
+max.bulk.size | Maximum size of the bulk in bytes. | 2500500
+exponential.backoff.millis | Initial delay for the [BulkProcessor](https://static.javadoc.io/org.elasticsearch/elasticsearch/5.3.0/org/elasticsearch/action/bulk/BulkProcessor.html) exponential backoff policy. | 100
+exponential.backoff.retries | Number of times the [BulkProcessor](https://static.javadoc.io/org.elasticsearch/elasticsearch/5.3.0/org/elasticsearch/action/bulk/BulkProcessor.html) exponential backoff policy should retry bulk operations. | 10
+bulk.close.timeout | Number of seconds allowed for the [BulkProcessor](https://static.javadoc.io/org.elasticsearch/elasticsearch/5.3.0/org/elasticsearch/action/bulk/BulkProcessor.html) to complete its last operation. | 600
diff --git a/src/plugin/indexer-rabbit/README.md b/src/plugin/indexer-rabbit/README.md
@@ -0,0 +1,44 @@
+indexer-rabbit plugin for Nutch
+===============================
+
+**indexer-rabbit plugin** is used for sending documents from one or more segments to a RabbitMQ server. The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:
+
+```xml
+<writer id="<writer_id>" class="org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter">
+  <mapping>
+    ...
+  </mapping>
+  <parameters>
+    ...
+  </parameters>
+</writer>
+```
+
+Each `<writer>` element has two mandatory attributes:
+
+* `<writer_id>` is a unique identification for each configuration. This feature allows Nutch to distinguish each configuration, even when they are for the same index writer. In addition, it allows to have multiple instances for the same index writer, but with different configurations.
+
+* `org.apache.nutch.indexwriter.rabbit.RabbitIndexWriter` corresponds to the canonical name of the class that implements the IndexWriter extension point. This value should not be modified for the **indexer-rabbit plugin**.
+
+## Mapping
+
+The mapping section is explained [here](https://wiki.apache.org/nutch/IndexWriters#Mapping_section). The structure of this section is general for all index writers.
+
+## Parameters
+
+Each parameter has the form `<param name="<name>" value="<value>"/>` and the parameters for this index writer are:
+
+Parameter Name | Description | Default value
+--|--|--
+server.uri | URI with connection parameters in the form `amqp://<username>:<password>@<hostname>:<port>/<virtualHost>`<br>Where:<ul><li>`<username>` is the username for RabbitMQ server.</li><li>`<password>` is the password for RabbitMQ server.</li><li>`<hostname>` is where the RabbitMQ server is running.</li><li>`<port>` is where the RabbitMQ server is listening.</li><li>`<virtualHost>` is where the exchange is and the user has access.</li></ul> | amqp://guest:guest@localhost:5672/
+binding | Whether the relationship between an exchange and a queue is created automatically.<br>**NOTE:** Binding between exchanges is not supported. | false
+binding.arguments | Arguments used in binding. It must have the form `key1=value1,key2=value2`. This value is only used when the exchange's type is headers and the value of binding property is **true**. In other cases is ignored. | 
+exchange.name | Name for the exchange where the messages will be sent. | 
+exchange.options | Options used when the exchange is created. Only used when the value of `binding` property is **true**. It must have the form `type=<type>,durable=<durable>`<br>Where:<ul><li>`<type>` is **direct**, **topic**, **headers** or **fanout**</li><li>`<durable>` is **true** or **false** | type=direct,durable=true</li></ul>
+queue.name | Name of the queue used to create the binding. Only used when the value of `binding` property is **true**. | nutch.queue
+queue.options |  Options used when the queue is created. Only used when the value of `binding` property is **true**. It must have the form `durable=<durable>,exclusive=<exclusive>,auto-delete=<auto-delete>,arguments=<arguments>`<br>Where:<ul><li>`<durable>` is **true** or **false**</li><li>`<exclusive>` is **true** or **false**</li><li>`<auto-delete>` is **true** or **false**</li><li>`<arguments>` must be the form `key1:value1;key2:value2` | durable=true,exclusive=false,auto-delete=false</li></ul>
+routingkey | The routing key used to route messages in the exchange. It only makes sense when the exchange type is **topic** or **direct**. | Value of `queue.name` property
+commit.mode | **single** if a message contains only one document. In this case, a header with the action (write, update or delete) will be added. **multiple** if a message contains all documents. | multiple
+commit.size | Amount of documents to send into each message if the value of `commit.mode` property is **multiple**. In **single** mode this value represents the amount of messages to be sent. | 250
+headers.static | Headers to add to each message. It must have the form `key1=value1,key2=value2`. | 
+headers.dynamic | Document's fields to add as headers to each message. It must have the form `field1,field2`. Only used when the value of `commit.mode` property is **single**. |