Airbyte S3 Pinecone RAG repo provides a quickstart template for building a full data stack using Airbyte cloud, Terraform, and dbt to move data from S3 -> BigQuery -> Pinecone for interacting with fetched data through an LLM and form a full fledged Retrieval-Augmented Generation.
This quickstart is designed to minimize setup hassles and propel you forward.
- Prerequisites
- Creating an Environment For Your Project
- Setting Up BigQuery
- Adding Configuration Values
- Setting Up Airbyte Connectors
- Sync S3 Data into BigQuery
- Setting Up the dbt Project
- Publishing Into Pinecone
- Asking Questions About Your Data
- Next Steps
The source to fetch data is from s3 to make the content searchable in Pinecone. Follow the S3 source docs for information on configuring a S3 source.
BigQuery will store the raw API data (csv in example) from our sources and also the transformed data from dbt. You'll need a BigQuery project and a dataset with a service account that can control the dataset. Airbyte's BigQuery destination docs lists the requirements and links describing how to configure.
Pinecone is the vector database we will use to index documents and their metadata, and also for finding documents that provide context for a query. You'll need a Pinecone account, an API key, and an index created with 1536 dimensions, as OpenAI returns vectors of 1536 dimensions. See the Pinecone docs for more information.
OpenAI is used both in processing the query and also provides the LLM for generating a response. The query is vectorized so it can be used to identify relevant items in the Pinecone index, and these items are provided to the LLM as context to better respond to the query. You'll need an OpenAI account with credits and an API key. If you already have an account with OpenAI, you can generate a new API key by visiting this link: platform.openai.com/api-keys
-
Python 3.10 or later: If not installed, download and install it from Python's official website.
-
Docker and Docker Compose (Docker Desktop): Install Docker following the official documentation for your specific OS.
-
Airbyte OSS version: If you want to deploy the open-source version instead of using Airbyte Cloud, follow the installation instructions from the Deploy Airbyte documentation.
-
Terraform: Terraform will help you provision and manage the Airbyte resources. If you haven't installed it, follow the official Terraform installation guide.
Get the project up and running on your local machine by following these steps:
-
Clone the repository (Clone only this quickstart):
git clone --filter=blob:none --sparse https://github.com/airbytehq/quickstarts.git
cd quickstarts
git sparse-checkout add pyairbyte_s3_pinecone_rag
-
Navigate to the directory:
cd pyairbyte_s3_pinecone_rag
-
Set Up a Virtual Environment:
- For Mac:
python3 -m venv .venv source .venv/bin/activate
- For Windows:
python -m venv .venv .\.venv\Scripts\activate
- For Mac:
-
Install Dependencies:
pip install -e ".[dev]"
- If you have a Google Cloud project, you can skip this step.
- Go to the Google Cloud Console.
- Click on the "Select a project" dropdown at the top right and select "New Project".
- Give your project a name and follow the steps to create it.
- In the Google Cloud Console, go to BigQuery.
- Make two new datasets:
raw_data
for Airbyte andtransformed_data
for dbt.- If you pick different names, remember to change the names in the code too.
How to create a dataset:
- In the left sidebar, click on your project name.
- Click “Create Dataset”.
- Enter the dataset ID (either
raw_data
ortransformed_data
). - Click "Create Dataset".
- Go to “IAM & Admin” > “Service accounts” in the Google Cloud Console.
- Click “Create Service Account”.
- Name your service account (like
airbyte-service-account
). - Assign the “BigQuery Data Editor” and “BigQuery Job User” roles to the service account.
- Follow the same steps to make another service account for dbt (like
dbt-service-account
) and assign the roles.
How to create a service account and assign roles:
- While creating the service account, under the “Grant this service account access to project” section, click the “Role” dropdown.
- Choose the “BigQuery Data Editor” and “BigQuery Job User” roles.
- Finish the creation process.
- For both service accounts, make a JSON key to let the service accounts sign in.
How to generate JSON key:
- Find the service account in the “Service accounts” list.
- Click on the service account name.
- In the “Keys” section, click “Add Key” and pick JSON.
- The key will download automatically. Keep it safe and don’t share it.
- Do this for the other service account too.
The following steps will execute Terraform and dbt workflows to create the necessary resources for the integration. To do this, you'll need to provide some configuration values. Copy the provided .env.template
file to .env
and set its values. Then run the following command to source the environment variables into your shell so they are available when running Terraform and dbt:
set -o allexport && source .env && set +o allexport
Don't forget to re-run the above command after making any changes to the .env
file.
Create the sources and destinations within your Airbyte environment, you can follow the create connections to define connections between them to control how and when the data will sync.
You can find the Airbyte workspace ID from the URL, e.g. https://cloud.airbyte.com/workspaces/{workspace-id}/connections
.
Airbyte allows you to create connectors for sources and destinations, facilitating data synchronization between various platforms. In this project, we're harnessing the power of Terraform to automate the creation of these connectors and the connections between them. Here's how you can set it up:
-
Navigate to the Airbyte Configuration Directory:
Change to the relevant directory containing the Terraform configuration for Airbyte:
cd infra/airbyte
-
Modify Configuration Files:
Within the
infra/airbyte
directory, you'll find three crucial Terraform files:provider.tf
: Defines the Airbyte provider.main.tf
: Contains the main configuration for creating Airbyte resources.variables.tf
: Defines variables whose values are populated from the.env
file.
If you're using Airbyte Cloud instead of a local deployment you will need to update the Airbyte provider configuration in infra/airbyte/provider.tf, setting the
bearer_auth
to an API key generated at https://portal.airbyte.com/. -
Initialize Terraform:
This step prepares Terraform to create the resources defined in your configuration files.
terraform init
-
Review the Plan:
Before applying any changes, review the plan to understand what Terraform will do.
terraform plan
-
Apply Configuration:
After reviewing and confirming the plan, apply the Terraform configurations to create the necessary Airbyte resources.
terraform apply
-
Verify in Airbyte UI:
Once Terraform completes its tasks, navigate to the Airbyte UI. Here, you should see your source and destination connectors, as well as the connection between them, set up and ready to go.
If you're using Airbyte Cloud, we configured Terraform to assemble the URL for you:
terraform output
will print something like:
airbyte_cloud_url = "https://cloud.airbyte.com/workspaces/{workspace-id}/connections"
Before building the dbt project, transforming the raw s3 data, the source tables must exist in the BigQuery dataset. Open the Airbyte UI and navigate to the Connections page. Click the Sync Now button for S3 to BigQuery
to start the sync.
Once the sync is complete you can inspect the tables in BigQuery.
dbt (data build tool) allows you to transform your data by writing, documenting, and executing SQL workflows. Setting up the dbt project requires specifying connection details for your data platform, in this case, BigQuery. Here’s a step-by-step guide to help you set this up:
-
Navigate to the dbt Project Directory:
Change to the directory containing the dbt configuration:
cd dbt_project
You'll find a
profiles.yml
file within the directory. This file contains configurations for dbt to connect with your data platform, and is preconfigured to pull the connection details from environment variables.You can use the
dbt_project/example/schema.yml
to make your own dbt_model and create your own sql files corresponding to the dbt_model names -
Utilize Environment Variables (Optional but Recommended):
To keep your credentials secure, you can leverage environment variables. An example is provided as the
dev
config inprofiles.yml
. -
Test the Connection:
Once you’ve updated the connection details, you can test the connection to your BigQuery instance using:
dbt debug
If everything is set up correctly, this command should report a successful connection to BigQuery.
-
Run the Model
With the connection in place, you can now build the model to create the
S3
view in BigQuery, which the configured Airbyte connection will use to sync S3 data into Pinecone.You can inspect the provided dbt_project/models/marts/purchase_data.sql to see removing null values after querying for purchase patterns.
dbt build
You should now see the
s3_data
view in BigQuery.
With the source data transformed it is now ready to publish into the Pinecone index. Head back to the Connections and start a sync for Publish BigQuery Data to Pinecone
.
After Airbyte has published the S3 data text into your Pinecone index, it is ready to be interacted with via an LLM model. After providing a couple more environment variables, the provided query.py will provide an interactive session to ask questions about your data.
export OPENAI_API_KEY=openai_api_key
export PINECONE_API_KEY=pinecone_api_key
export PINECONE_ENVIRONMENT=pinecone_environment
export PINECONE_INDEX=pinecone_index
python query.py
Once you've set up and launched this initial integration, the real power lies in its adaptability and extensibility. Here’s a roadmap to help you customize and harness this project tailored to your specific data needs:
-
Use Additional Data Sources:
Pinecone allows you to store vectors from multiple sources in a single index. This allows multiple document types to be queried together, and even referenced between themselves by the LLM. Add additional sources to the Terraform configuration or manually through the Airbyte UI.
-
Try Different LLM Chains:
The provided query.py uses many default values when creating the LLM chain, which is quick to create but limits flexibility. One constraint is that the
stuff
chain type requires the query and discovered documents all fit within a single tokenized input to the LLM. This is fine for short queries and documents, but if you use longer sources you will need to use another chain type such asmap_reduce
. -
Extend Access to the LLM:
The provided query.py is a simple example of how to interact with the LLM locally. You could build a web-based UI that triggers a query and displays the results, or even integrate the LLM into a Slack bot to provide answers to questions in real-time.
-
Make Use of Metadata
When configuring the Pinecone destination you can choose to include metadata fields from the data. This allows you to filter results based on the metadata fields. For example, you could filter the results of a query to only include documents from a specific author, or provide a time range that a matching item must have been created or edited within. Metadata values are also provided back to the query executor so you can use them to provide additional context to the LLM.
See Filtering with metadata for more details.