Skip to content

Commit

Permalink
add supabase and postgres + pgvector datastore providers (openai#53)
Browse files Browse the repository at this point in the history
* add supabase + pgvector datastore provider

* fix conflicts: move supabase docs from readme

* add supabase to unused deps doc

* add supabase local setup docs and tests

* small improvements in setup.md

* add pure postgres implementation and more tests

* add postgres datastore docs to readme

* rebase to latest main

* fix typo in readme - postgres envs

* add some indexes and pgvector idx strategy in docs

* enable RLS by default
  • Loading branch information
egor-romanov authored May 15, 2023
1 parent 649dedf commit 3623ab2
Show file tree
Hide file tree
Showing 16 changed files with 2,190 additions and 232 deletions.
36 changes: 31 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ This README provides detailed information on how to set up, develop, and deploy
- [Llama Index](#llamaindex)
- [Chroma](#chroma)
- [Azure Cognitive Search](#azure-cognitive-search)
- [Supabase](#supabase)
- [Postgres](#postgres)
- [Running the API Locally](#running-the-api-locally)
- [Testing a Localhost Plugin in ChatGPT](#testing-a-localhost-plugin-in-chatgpt)
- [Personalization](#personalization)
Expand Down Expand Up @@ -142,6 +144,17 @@ Follow these steps to quickly set up and run the ChatGPT Retrieval Plugin:
export AZURESEARCH_SERVICE=<your_search_service_name>
export AZURESEARCH_INDEX=<your_search_index_name>
export AZURESEARCH_API_KEY=<your_api_key> (optional, uses key-free managed identity if not set)
# Supabase
export SUPABASE_URL=<supabase_project_url>
export SUPABASE_ANON_KEY=<supabase_project_api_anon_key>
# Postgres
export PG_HOST=<postgres_host>
export PG_PORT=<postgres_port>
export PG_USER=<postgres_user>
export PG_PASSWORD=<postgres_password>
export PG_DATABASE=<postgres_database>
```

10. Run the API locally: `poetry run start`
Expand Down Expand Up @@ -253,11 +266,11 @@ poetry install

The API requires the following environment variables to work:

| Name | Required | Description |
| ---------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `DATASTORE` | Yes | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `chroma`, `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, `redis`, `azuresearch`. |
| `BEARER_TOKEN` | Yes | This is a secret token that you need to authenticate your requests to the API. You can generate one using any tool or method you prefer, such as [jwt.io](https://jwt.io/). |
| `OPENAI_API_KEY` | Yes | This is your OpenAI API key that you need to generate embeddings using the `text-embedding-ada-002` model. You can get an API key by creating an account on [OpenAI](https://openai.com/). |
| Name | Required | Description |
| ---------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `DATASTORE` | Yes | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `chroma`, `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, `redis`, `azuresearch`, `supabase`, `postgres`. |
| `BEARER_TOKEN` | Yes | This is a secret token that you need to authenticate your requests to the API. You can generate one using any tool or method you prefer, such as [jwt.io](https://jwt.io/). |
| `OPENAI_API_KEY` | Yes | This is your OpenAI API key that you need to generate embeddings using the `text-embedding-ada-002` model. You can get an API key by creating an account on [OpenAI](https://openai.com/). |

### Using the plugin with Azure OpenAI

Expand Down Expand Up @@ -316,6 +329,14 @@ For detailed setup instructions, refer to [`/docs/providers/llama/setup.md`](/do

[Azure Cognitive Search](https://azure.microsoft.com/products/search/) is a complete retrieval cloud service that supports vector search, text search, and hybrid (vectors + text combined to yield the best of the two approaches). It also offers an [optional L2 re-ranking step](https://learn.microsoft.com/azure/search/semantic-search-overview) to further improve results quality. For detailed setup instructions, refer to [`/docs/providers/azuresearch/setup.md`](/docs/providers/azuresearch/setup.md)

#### Supabase

[Supabase](https://supabase.com/blog/openai-embeddings-postgres-vector) offers an easy and efficient way to store vectors via [pgvector](https://github.com/pgvector/pgvector) extension for Postgres Database. [You can use Supabase CLI](https://github.com/supabase/cli) to set up a whole Supabase stack locally or in the cloud or you can also use docker-compose, k8s and other options available. For a hosted/managed solution, try [Supabase.com](https://supabase.com/) and unlock the full power of Postgres with built-in authentication, storage, auto APIs, and Realtime features. For detailed setup instructions, refer to [`/docs/providers/supabase/setup.md`](/docs/providers/supabase/setup.md).

#### Postgres

[Postgres](https://www.postgresql.org) offers an easy and efficient way to store vectors via [pgvector](https://github.com/pgvector/pgvector) extension. To use pgvector, you will need to set up a PostgreSQL database with the pgvector extension enabled. For example, you can [use docker](https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/) to run locally. For a hosted/managed solution, you may try [Supabase](https://supabase.com/) or any other cloud provider with support for pgvector. For detailed setup instructions, refer to [`/docs/providers/postgres/setup.md`](/docs/providers/postgres/setup.md).

### Running the API locally

To run the API locally, you first need to set the requisite environment variables with the `export` command:
Expand Down Expand Up @@ -506,3 +527,8 @@ We would like to extend our gratitude to the following contributors for their co
- [LlamaIndex](https://github.com/jerryjliu/llama_index)
- [jerryjliu](https://github.com/jerryjliu)
- [Disiok](https://github.com/Disiok)
- [Supabase](https://supabase.com/)
- [egor-romanov](https://github.com/egor-romanov)
- [Postgres](https://www.postgresql.org/)
- [egor-romanov](https://github.com/egor-romanov)
- [mmmaia](https://github.com/mmmaia)
9 changes: 9 additions & 0 deletions datastore/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ async def get_datastore() -> DataStore:
return ChromaDataStore()
case "llama":
from datastore.providers.llama_datastore import LlamaDataStore

return LlamaDataStore()

case "pinecone":
Expand Down Expand Up @@ -43,6 +44,14 @@ async def get_datastore() -> DataStore:
from datastore.providers.azuresearch_datastore import AzureSearchDataStore

return AzureSearchDataStore()
case "supabase":
from datastore.providers.supabase_datastore import SupabaseDataStore

return SupabaseDataStore()
case "postgres":
from datastore.providers.postgres_datastore import PostgresDataStore

return PostgresDataStore()
case _:
raise ValueError(
f"Unsupported vector database: {datastore}. "
Expand Down
180 changes: 180 additions & 0 deletions datastore/providers/pgvector_datastore.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
from abc import ABC, abstractmethod
from typing import Any, Dict, List, Optional
from datetime import datetime

from services.date import to_unix_timestamp
from datastore.datastore import DataStore
from models.models import (
DocumentChunk,
DocumentChunkMetadata,
DocumentMetadataFilter,
QueryResult,
QueryWithEmbedding,
DocumentChunkWithScore,
)


# interface for Postgres client to implement pg based Datastore providers
class PGClient(ABC):
@abstractmethod
async def upsert(self, table: str, json: dict[str, Any]) -> None:
"""
Takes in a list of documents and inserts them into the table.
"""
raise NotImplementedError

@abstractmethod
async def rpc(self, function_name: str, params: dict[str, Any]) -> Any:
"""
Calls a stored procedure in the database with the given parameters.
"""
raise NotImplementedError

@abstractmethod
async def delete_like(self, table: str, column: str, pattern: str) -> None:
"""
Deletes rows in the table that match the pattern.
"""
raise NotImplementedError

@abstractmethod
async def delete_in(self, table: str, column: str, ids: List[str]) -> None:
"""
Deletes rows in the table that match the ids.
"""
raise NotImplementedError

@abstractmethod
async def delete_by_filters(
self, table: str, filter: DocumentMetadataFilter
) -> None:
"""
Deletes rows in the table that match the filter.
"""
raise NotImplementedError


# abstract class for Postgres based Datastore providers that implements DataStore interface
class PgVectorDataStore(DataStore):
def __init__(self):
self.client = self.create_db_client()

@abstractmethod
def create_db_client(self) -> PGClient:
"""
Create db client, can be accessing postgres database via different APIs.
Can be supabase client or psycopg2 based client.
Return a client for postgres DB.
"""

raise NotImplementedError

async def _upsert(self, chunks: Dict[str, List[DocumentChunk]]) -> List[str]:
"""
Takes in a dict of document_ids to list of document chunks and inserts them into the database.
Return a list of document ids.
"""
for document_id, document_chunks in chunks.items():
for chunk in document_chunks:
json = {
"id": chunk.id,
"content": chunk.text,
"embedding": chunk.embedding,
"document_id": document_id,
"source": chunk.metadata.source,
"source_id": chunk.metadata.source_id,
"url": chunk.metadata.url,
"author": chunk.metadata.author,
}
if chunk.metadata.created_at:
json["created_at"] = (
datetime.fromtimestamp(
to_unix_timestamp(chunk.metadata.created_at)
),
)
await self.client.upsert("documents", json)

return list(chunks.keys())

async def _query(self, queries: List[QueryWithEmbedding]) -> List[QueryResult]:
"""
Takes in a list of queries with embeddings and filters and returns a list of query results with matching document chunks and scores.
"""
query_results: List[QueryResult] = []
for query in queries:
# get the top 3 documents with the highest cosine similarity using rpc function in the database called "match_page_sections"
params = {
"in_embedding": query.embedding,
}
if query.top_k:
params["in_match_count"] = query.top_k
if query.filter:
if query.filter.document_id:
params["in_document_id"] = query.filter.document_id
if query.filter.source:
params["in_source"] = query.filter.source.value
if query.filter.source_id:
params["in_source_id"] = query.filter.source_id
if query.filter.author:
params["in_author"] = query.filter.author
if query.filter.start_date:
params["in_start_date"] = datetime.fromtimestamp(
to_unix_timestamp(query.filter.start_date)
)
if query.filter.end_date:
params["in_end_date"] = datetime.fromtimestamp(
to_unix_timestamp(query.filter.end_date)
)
try:
data = await self.client.rpc("match_page_sections", params=params)
results: List[DocumentChunkWithScore] = []
for row in data:
document_chunk = DocumentChunkWithScore(
id=row["id"],
text=row["content"],
# TODO: add embedding to the response ?
# embedding=row["embedding"],
score=float(row["similarity"]),
metadata=DocumentChunkMetadata(
source=row["source"],
source_id=row["source_id"],
document_id=row["document_id"],
url=row["url"],
created_at=row["created_at"],
author=row["author"],
),
)
results.append(document_chunk)
query_results.append(QueryResult(query=query.query, results=results))
except Exception as e:
print("error:", e)
query_results.append(QueryResult(query=query.query, results=[]))
return query_results

async def delete(
self,
ids: Optional[List[str]] = None,
filter: Optional[DocumentMetadataFilter] = None,
delete_all: Optional[bool] = None,
) -> bool:
"""
Removes vectors by ids, filter, or everything in the datastore.
Multiple parameters can be used at once.
Returns whether the operation was successful.
"""
if delete_all:
try:
await self.client.delete_like("documents", "document_id", "%")
except:
return False
elif ids:
try:
await self.client.delete_in("documents", "document_id", ids)
except:
return False
elif filter:
try:
await self.client.delete_by_filters("documents", filter)
except:
return False
return True
Loading

0 comments on commit 3623ab2

Please sign in to comment.