Skip to content

jkomar-wm/wm-datahub-data-catalog

Repository files navigation

DataHub Quickstart Guide

Export Environment Variables

export UAT_REDSHIFT_DB_HOST_PORT=<UAT-HOST-PORT>
export UAT_REDSHIFT_DB_USER=<UAT-USER>
export UAT_REDSHIFT_DB_PASS=<UAT-PASS>

export PROD_REDSHIFT_DB_HOST_PORT=<PROD-HOST-PORT>
export PROD_REDSHIFT_DB_USER=<PROD-USER>
export PROD_REDSHIFT_DB_PASS=<PROD-PASS>

Install the Redshift plugins

pip install 'acryl-datahub[redshift]'
pip install 'acryl-datahub[redshift-usage]'

Deploying DataHub

To deploy a new instance of DataHub, perform the following steps.

  1. Install docker, jq and docker-compose (if using Linux). Make sure to allocate enough hardware resources for Docker engine. Tested & confirmed config: 2 CPUs, 8GB RAM, 2GB Swap area, and 10GB disk space.

  2. Launch the Docker Engine from command line or the desktop app.

  3. Install the DataHub CLI

    a. Ensure you have Python 3.6+ installed & configured. (Check using python3 --version)

    b. Run the following commands in your terminal

    python3 -m pip install --upgrade pip wheel setuptools
    python3 -m pip uninstall datahub acryl-datahub || true  # sanity check - ok if it fails
    python3 -m pip install --upgrade acryl-datahub
    datahub version
    

    If you see "command not found", try running cli commands with the prefix 'python3 -m' instead: python3 -m datahub version

  4. To deploy DataHub, run the following CLI command from your terminal

    datahub docker quickstart
    

    Upon completion of this step, you should be able to navigate to the DataHub UI at http://localhost:9002 in your browser. You can sign in using datahub as both the username and password.

  5. To ingest the sample metadata, run the following CLI command from your terminal

    datahub docker ingest-sample-data
    

That's it! To start pushing your company's metadata into DataHub, take a look at the Metadata Ingestion Framework.

Resetting DataHub

To cleanse DataHub of all of it's state (e.g. before ingesting your own), you can use the CLI nuke command.

datahub docker nuke

If you want to delete the containers but keep the data you can add --keep-data flag to the command. This allows you to run the quickstart command to get DataHub running with your data that was ingested earlier.

Troubleshooting

Command not found: datahub

If running the datahub cli produces "command not found" errors inside your terminal, your system may be defaulting to an older version of Python. Try prefixing your datahub commands with python3 -m:

python3 -m datahub docker quickstart

Another possibility is that your system PATH does not include pip's $HOME/.local/bin directory. On linux, you can add this to your ~/.bashrc:

if [ -d "$HOME/.local/bin" ] ; then
    PATH="$HOME/.local/bin:$PATH"
fi

Miscellaneous Docker issues

There can be misc issues with Docker, like conflicting containers and dangling volumes, that can often be resolved by pruning your Docker state with the following command. Note that this command removes all unused containers, networks, images (both dangling and unreferenced), and optionally, volumes.

docker system prune

Usage

$ datahub ingest run --help
Usage: datahub ingest run [OPTIONS]

  Ingest metadata into DataHub.

Options:
  -c, --config FILE               Config file in .toml or .yaml format.  [required]
  -n, --dry-run                   Perform a dry run of the ingestion, essentially skipping writing to sink.
  --preview                       Perform limited ingestion from the source to the sink to get a quick preview.
  --strict-warnings / --no-strict-warnings
                                  If enabled, ingestion runs with warnings will yield a non-zero error code
  --help                          Show this message and exit.

About

DataHub: The Metadata Platform for the Modern Data Stack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published