This guide contains information for managing the search index for the Couchbase documentation site. It covers where the search index lives and how to update it. This procedure described here is used by the CI job defined by the Jenkinsfile in this directory. This document is helpful to understand how the CI job works, or how to perform the update manually, if necessary.
The search index for the documentation is hosted by Algolia. The index, named prod_docs_couchbase, is stored in the Couchbase Algolia account. The index is populated by the docsearch scraper (aka crawler).
The sections below document the prerequisites for running the docsearch scraper and how to run the docsearch scraper to update the index.
-
git (to clone the docsearch-scraper repository)
-
pipenv (to manage a local Python installation and packages)
-
Chrome/Chromium and chromedriver (or Docker)
To begin, clone the https://github.com/algolia/docsearch-scraper repository using git.
$ git clone https://github.com/algolia/docsearch-scraper && cd "`basename $_`"
Next, create an .env file in the cloned repository to define the application ID (APPLICATION_ID
) and write API key (API_KEY
).
To protect the API key, only the final four characters of the API key are shown here.
APPLICATION_ID=NI1G57N08Q API_KEY=****************************67dd
Important
|
The API key used in this file is different than the one used for searching. In the Algolia dashboard, it’s labeled as the Write API Key. |
The next step is to set up the Python environment and install the required packages.
$ pipenv install && pipenv shell
If you don’t plan to use the Docker image, you’ll need to install both Chrome (or Chromium) and chromedriver. Run the following command to make sure Chromedriver is installed successfully:
$ chromedriver --version
Finally, you’ll need the docsearch configuration file. This configuration file is located in the playbook repository for the Couchbase documentation. Download the file from https://github.com/couchbase/docs-site/raw/master/docsearch/docsearch-config.json and save it to the cloned repository.
You’re now ready to run the scraper.
There are three ways to run the scraper:
-
docsearch run (uses local packages and chromedriver)
-
docsearch docker:run (uses local packages and provided Docker image)
-
docker run (uses provided Docker image)
Warning
|
Rebuilding the index takes about 30 minutes because it has to visit every page in the site. |
To update the index, pass the config file to the docsearch run
command:
$ ./docsearch run docsearch-config.json
If that succeeds, skip to [Activate Index].
If that command fails, you may need to run it in the provided Docker container.
First, make sure you have Docker running on your machine and that you can list images.
$ docker images
Then, run the docsearch
command again, but use the Docker container instead:
$ ./docsearch docker:run docsearch-config.json
The search index is now updated.
Using Docker, it’s possible to bypass the use of pipenv by invoking docker run
directly.
First, create a script named scrape
with the following contents:
#!/usr/bin/bash source .env docker run \ -e APPLICATION_ID=$APPLICATION_ID \ -e API_KEY=$API_KEY \ -e CONFIG="`cat ${1:-config.json}`" \ -t --rm algolia/docsearch-scraper \ /root/run
Then, make it executable:
$ chmod 755 scrape
Finally, run it, passing the configuration file as the first argument:
$ ./scrape docsearch-config.json
The search index is now updated.