Skip to content

The data-validation toolkit for enhanced dbt (data build tool) PR review

License

Notifications You must be signed in to change notification settings

DataRecce/recce

Recce: DataRecce.io

The data validation toolkit
for teams that care about building better data

install   pipy   Python   downloads   license   Slack   InfuseAI Discord Invite  

Book us with Cal.com

Introduction

Recce is data validation toolkit for pull request (PR) review in dbt projects. Get enhanced visibility into how your team’s dbt modeling changes impact data by comparing your dev branch with stable production data. Run manual data checks during development, and automate checks in CI for PR review.

Quick Start

Get up and running quickly by prepping your dev and prod environments. The key is building prod into the target-base folder to use as the base for the data comparison.

# Build prod and generate dbt docs into ./target-base
dbt seed --target prod
dbt run --target prod
dbt docs generate --target prod --target-path ./target-base

# Switch to your dev branch
git switch my-awesome-branch

# build your dev environment
dbt seed
dbt run
dbt docs generate

# Start a Recce Instance
recce server

Follow our 5-minute Jaffle Shop tutorial to try it out for yourself.

What you get

recce server launches a web UI that shows you the area of your lineage that is impacted by the branch changes.

Using Recce for Impact Assessment in dbt PR Review

  • Select nodes in the lineage to perform Checks (diffs) as part of your impact assessment during development or PR review.
  • Add Checks to your Checklist to note observed impacts.
  • Share your Checklist with the PR reviewer.
  • (Recce Cloud) Automatically sync Check status between Recce Instances
  • (Recce Cloud) Block PR merging until all Recce Checks have been approved

Read more about using Recce for Impact Assessment on the Recce blog.

Try the Online Demo

We provide three online Recce demos (based on Jaffle Shop), each is related to a specific pull request. Use these demos to inspect the data impact caused by the modeling changes in the PR.

For each demo, review the following:

  • The pull request comment
  • The code changes
  • How the lineage and data has changed in Recce

This will enable you to validate if the intention of the PR has been successfully implemented without unintended impact.

Tip

Don't forget to click the Checks tab to view the Recce Checklist, and perform your own Checks for further investigation.

Demo 1: Calculation logic change

This pull request adjusts the logic for how customer lifetime value is calculated:

Demo 2: Refactoring

This pull request performs some refactoring on the customers model by turning two CTEs into intermediate models, enhancing readability and maintainability:

Demo 3: Analysis

This pull request introduces a new Rounding Effect Analysis feature, aimed at analyzing and reporting the impacts of rounding in our data processing.

Why Recce

dbt has brought many software best practices to data projects, such as:

  • Version controlled code
  • Modular SQL
  • Reproducible pipelines

Even so, 'bad merges' still happen and erroneous data and silent errors make their way into prod data. As self-serve analytics opens dbt projects to many roles, and the size of dbt projects increase, the job of reviewing data modeling changes is even more critical.

The only way to understand the impact of code changes on data is to compare the data before-and-after the changes.

Features

Recce provides a data review environment for data teams to check their work during development, and then again as part of PR review. The suite of tools and diffs in Recce are specifically geared towards surfacing, understanding, and recording data impact from code changes.

Lineage Diff

Lineage Diff is the main interface to Recce and shows which nodes in the lineage have been added, removed, or modified.

Structural Diffs

  • Schema Diff - Show the struture of the table including added or removed columns
  • Row Count Diff - Compares the row count for tables

Statistical Diffs

Advanced Diffs provide high level statistics about data change:

  • Profile Diff: Compares stats such as count, distinct count, min, max, average.
  • Value Diff: The matched count and percentage for each column in the table.
  • Top-K Diff: Compares the distribution of a categorical column.
  • Histogram Diff: Compares the distribution of a numeric column in an overlay histogram chart.

Query Diff

Query Diff compares the results of any ad-hoc query, and supports the use of dbt macros.

Checklist

The checklist provides a way to record the results of your data validation process.

  • Save the results of checks
  • Re-run checks
  • Annotate checks to add context
  • Share the results of checks
  • (Recce Cloud) Sync checks and check results across Recce instances
  • (Recce Cloud) Block PR merging until checks have been approved

Who's using Recce?

Recce is useful for validating your own work or the work of others, and can also be used to share data impact with non-technical stakeholders to approve data checks.

  • Data engineers can use Recce to ensure the structural integrity of the data and understand the scope of impact before merging.
  • Analysts can use Recce to self-review and understand how data modeling changes have changed the data.
  • Stakeholders can use Recce to sign-off on data after updates have been made

Documentation / How to use Recce

The Recce Documentation covers everything you need to get started.

We’d advise first following the 5-minute tutorial that uses Jaffle Shop and then trying out Recce in your own project.

For advice on best practices in preparing dbt environments to enable effective PR review, check out Best Practices for Preparing Environments.

Recce Cloud

Recce Cloud provides a backbone of supporting services that make Recce usage more suitable for teams reviewing multiple pull requests.

With Recce Cloud:

  • Recce Instances can be launched directly from a PR
  • Checks are automatically synced across Recce Instances
  • Blocked merging until all checks are approved

Recce Cloud is currently in early-access private beta.

To find out how you can get access please book an appointment for a short meeting.

Book us with [Cal.com](http://cal.com/)

Data Security

Recce consists of a local server application that you run on your own device or compute services.

  • Diffs or queries that are performed by Recce happen either in your data warehouse, or in the browser itself.
  • Recce does not store your data.

For Recce Cloud users:

  • An encrypted version of your Recce state file is storedon Recce Cloud. This file is encrypted before transmission.

Community & Support

Here's where you can get in touch with the Recce team and find support:

If you believe you have found a bug, or there is some missing functionality in Recce, please open a GitHub Issue.

Recce on the web

You can follow along with news about Recce and blogs from our team in the following places: