Duro creates materialized views in Redshift using specified intervals and tracking their interdependencies.
“Duro” is a spell Hermione uses to turn two Death Eaters to stone.
Other features:
- web UI;
- tests;
- Python post-processing.
We have a wide table with lots of data about cities in first.cities_raw
, and we would like to have a smaller and more usable city-country mapping in first.cities
:
select city, country
from first.cities_raw
first:
- cities — 24h.sql
Alternatively, you can put this file on a higher level in directory hierarchy:
views:
- first.cities — 24h.sql
Now we’ll be recreating this view every 24 hours.
Let’s add a simple test and save to cities_test.sql
:
select (city = 'Paris') as correct_capital_of_france
from first.cities
where country = 'France';
select (city = 'Ottawa') as correct_capital_of_canada
from first.cities
where country = 'Canada';
If any of these values is not true
, we don’t replace older version of first.cities
with the new one.
Now let’s also add a config for this table in the same folder:
first:
- cities — 24h.sql
- cities_test.sql
- cities.conf
grant_select=jane, john
diststyle=even
distkey=city
sortkey=city
This will specify distkey, sortkey, and diststyle for our table, and run grant select on first.cities to jane, john;
.
You can see these examples in tests/views
folder.
Obviously, for this to work, someone has to set up everything correctly (and, honestly, this setting up part could’ve been better).
Duro has three main parts: scheduler, creator, and web interface server.
Scheduler goes through your directory, finds tables’ definitions and configurations, detects dependencies, and saves all this information to a local SQLite database. You probably want to set scheduler to run every time there is an update in your Git repository with views’ definitions.
Web interface shows currently and recently run jobs, statistics and logs for each table, and allows you to force updates. You can probably run it using gunicorn and reverse proxy server or whatever setup you prefer—it’s a pretty simple Flask app.
Creator queries this database for tables that should be updated (either because they haven’t been recreated for longer than their interval requires or because their update has been forced via web UI), and updates both their dependencies (first and if necessary) and these tables. You probably want to have this process continuously running (e.g., via supervisor).
What do we do after deciding that this particular table should be updated? For views without Python processing:
- Read query and config for this table from the database.
- Create
schema.table_duro_temp
usingcreate table as
query. - Run tests on
schema.table_duro_temp
(if there are any). Drop temporary table if tests fail, send notification about this failure. - If any (non-materialized) views depend on
schema.table
, update their definitions to useschema.table_duro_temp
. - Rename
schema.table
toschema.table_duro_old
. - Rename
schema.table_duro_temp
toschema.table
. - Drop
schema.table_duro_old
. - If necessary, create a snapshot of
schema.table
.
For tables with Python processing:
- Read query and config for this table from the database.
- Create virtual environment at
venvs/schema.table
. - Install requirements from
schema/table_requirements.txt
. - Select data using query in
schema/table_select.sql
; export this data to CSV. - Run
schema/table.py
in that virtual environment; this script should read CSV from file in itsargv[1]
and write to CSV inargv[2]
. - Upload resultant CSV to S3.
- Create
schema.table_duro_temp
using DDL inschema/table — {interval}.sql
. copy
data from CSV in S3 toschema.table_duro_temp
.- Run steps 3–8 from sequence for views without Python processors.
We read credentials for Redshift, S3, and Slack from duro/config.conf
. You can find a template for this file in duro/config.conf_example
.
We recommend creating a dedicated Redshift user. You will probably find it simpler to just give this user superuser permissions.
Yes, materialized views were (re)introduced in November 2019, but at the moment (January 2020) they’re still in preview stage and have quite a lot of limitations, most important of which is their inability to depend on other materialized views.
It would.