add supabase and postgres + pgvector datastore providers (openai#53)

* add supabase + pgvector datastore provider * fix conflicts: move supabase docs from readme * add supabase to unused deps doc * add supabase local setup docs and tests * small improvements in setup.md * add pure postgres implementation and more tests * add postgres datastore docs to readme * rebase to latest main * fix typo in readme - postgres envs * add some indexes and pgvector idx strategy in docs * enable RLS by default
nyangoto · May 15, 2023 · 3623ab2 · 3623ab2
1 parent 649dedf
commit 3623ab2
Show file tree

Hide file tree

Showing 16 changed files with 2,190 additions and 232 deletions.
diff --git a/README.md b/README.md
@@ -44,6 +44,8 @@ This README provides detailed information on how to set up, develop, and deploy
     - [Llama Index](#llamaindex)
     - [Chroma](#chroma)
     - [Azure Cognitive Search](#azure-cognitive-search)
+    - [Supabase](#supabase)
+    - [Postgres](#postgres)
   - [Running the API Locally](#running-the-api-locally)
   - [Testing a Localhost Plugin in ChatGPT](#testing-a-localhost-plugin-in-chatgpt)
   - [Personalization](#personalization)
@@ -142,6 +144,17 @@ Follow these steps to quickly set up and run the ChatGPT Retrieval Plugin:
    export AZURESEARCH_SERVICE=<your_search_service_name>
    export AZURESEARCH_INDEX=<your_search_index_name>
    export AZURESEARCH_API_KEY=<your_api_key> (optional, uses key-free managed identity if not set)
+
+   # Supabase
+   export SUPABASE_URL=<supabase_project_url>
+   export SUPABASE_ANON_KEY=<supabase_project_api_anon_key>
+
+   # Postgres
+   export PG_HOST=<postgres_host>
+   export PG_PORT=<postgres_port>
+   export PG_USER=<postgres_user>
+   export PG_PASSWORD=<postgres_password>
+   export PG_DATABASE=<postgres_database>
    ```
 
 10. Run the API locally: `poetry run start`
@@ -253,11 +266,11 @@ poetry install
 
 The API requires the following environment variables to work:
 
-| Name             | Required | Description                                                                                                                                                                                |
-| ---------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `DATASTORE`      | Yes      | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `chroma`, `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, `redis`, `azuresearch`. |
-| `BEARER_TOKEN`   | Yes      | This is a secret token that you need to authenticate your requests to the API. You can generate one using any tool or method you prefer, such as [jwt.io](https://jwt.io/).                |
-| `OPENAI_API_KEY` | Yes      | This is your OpenAI API key that you need to generate embeddings using the `text-embedding-ada-002` model. You can get an API key by creating an account on [OpenAI](https://openai.com/). |
+| Name             | Required | Description                                                                                                                                                                                                                    |
+| ---------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `DATASTORE`      | Yes      | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `chroma`, `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, `redis`, `azuresearch`, `supabase`, `postgres`. |
+| `BEARER_TOKEN`   | Yes      | This is a secret token that you need to authenticate your requests to the API. You can generate one using any tool or method you prefer, such as [jwt.io](https://jwt.io/).                                                    |
+| `OPENAI_API_KEY` | Yes      | This is your OpenAI API key that you need to generate embeddings using the `text-embedding-ada-002` model. You can get an API key by creating an account on [OpenAI](https://openai.com/).                                     |
 
 ### Using the plugin with Azure OpenAI
 
@@ -316,6 +329,14 @@ For detailed setup instructions, refer to [`/docs/providers/llama/setup.md`](/do
 
 [Azure Cognitive Search](https://azure.microsoft.com/products/search/) is a complete retrieval cloud service that supports vector search, text search, and hybrid (vectors + text combined to yield the best of the two approaches). It also offers an [optional L2 re-ranking step](https://learn.microsoft.com/azure/search/semantic-search-overview) to further improve results quality. For detailed setup instructions, refer to [`/docs/providers/azuresearch/setup.md`](/docs/providers/azuresearch/setup.md)
 
+#### Supabase
+
+[Supabase](https://supabase.com/blog/openai-embeddings-postgres-vector) offers an easy and efficient way to store vectors via [pgvector](https://github.com/pgvector/pgvector) extension for Postgres Database. [You can use Supabase CLI](https://github.com/supabase/cli) to set up a whole Supabase stack locally or in the cloud or you can also use docker-compose, k8s and other options available. For a hosted/managed solution, try [Supabase.com](https://supabase.com/) and unlock the full power of Postgres with built-in authentication, storage, auto APIs, and Realtime features. For detailed setup instructions, refer to [`/docs/providers/supabase/setup.md`](/docs/providers/supabase/setup.md).
+
+#### Postgres
+
+[Postgres](https://www.postgresql.org) offers an easy and efficient way to store vectors via [pgvector](https://github.com/pgvector/pgvector) extension. To use pgvector, you will need to set up a PostgreSQL database with the pgvector extension enabled. For example, you can [use docker](https://www.docker.com/blog/how-to-use-the-postgres-docker-official-image/) to run locally. For a hosted/managed solution, you may try [Supabase](https://supabase.com/) or any other cloud provider with support for pgvector. For detailed setup instructions, refer to [`/docs/providers/postgres/setup.md`](/docs/providers/postgres/setup.md).
+
 ### Running the API locally
 
 To run the API locally, you first need to set the requisite environment variables with the `export` command:
@@ -506,3 +527,8 @@ We would like to extend our gratitude to the following contributors for their co
 - [LlamaIndex](https://github.com/jerryjliu/llama_index)
   - [jerryjliu](https://github.com/jerryjliu)
   - [Disiok](https://github.com/Disiok)
+- [Supabase](https://supabase.com/)
+  - [egor-romanov](https://github.com/egor-romanov)
+- [Postgres](https://www.postgresql.org/)
+  - [egor-romanov](https://github.com/egor-romanov)
+  - [mmmaia](https://github.com/mmmaia)
diff --git a/datastore/factory.py b/datastore/factory.py
@@ -13,6 +13,7 @@ async def get_datastore() -> DataStore:
             return ChromaDataStore()
         case "llama":
             from datastore.providers.llama_datastore import LlamaDataStore
+
             return LlamaDataStore()
 
         case "pinecone":
@@ -43,6 +44,14 @@ async def get_datastore() -> DataStore:
             from datastore.providers.azuresearch_datastore import AzureSearchDataStore
 
             return AzureSearchDataStore()
+        case "supabase":
+            from datastore.providers.supabase_datastore import SupabaseDataStore
+
+            return SupabaseDataStore()
+        case "postgres":
+            from datastore.providers.postgres_datastore import PostgresDataStore
+
+            return PostgresDataStore()
         case _:
             raise ValueError(
                 f"Unsupported vector database: {datastore}. "

diff --git a/datastore/providers/pgvector_datastore.py b/datastore/providers/pgvector_datastore.py
@@ -0,0 +1,180 @@
+from abc import ABC, abstractmethod
+from typing import Any, Dict, List, Optional
+from datetime import datetime
+
+from services.date import to_unix_timestamp
+from datastore.datastore import DataStore
+from models.models import (
+    DocumentChunk,
+    DocumentChunkMetadata,
+    DocumentMetadataFilter,
+    QueryResult,
+    QueryWithEmbedding,
+    DocumentChunkWithScore,
+)
+
+
+# interface for Postgres client to implement pg based Datastore providers
+class PGClient(ABC):
+    @abstractmethod
+    async def upsert(self, table: str, json: dict[str, Any]) -> None:
+        """
+        Takes in a list of documents and inserts them into the table.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def rpc(self, function_name: str, params: dict[str, Any]) -> Any:
+        """
+        Calls a stored procedure in the database with the given parameters.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def delete_like(self, table: str, column: str, pattern: str) -> None:
+        """
+        Deletes rows in the table that match the pattern.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def delete_in(self, table: str, column: str, ids: List[str]) -> None:
+        """
+        Deletes rows in the table that match the ids.
+        """
+        raise NotImplementedError
+
+    @abstractmethod
+    async def delete_by_filters(
+        self, table: str, filter: DocumentMetadataFilter
+    ) -> None:
+        """
+        Deletes rows in the table that match the filter.
+        """
+        raise NotImplementedError
+
+
+# abstract class for Postgres based Datastore providers that implements DataStore interface
+class PgVectorDataStore(DataStore):
+    def __init__(self):
+        self.client = self.create_db_client()
+
+    @abstractmethod
+    def create_db_client(self) -> PGClient:
+        """
+        Create db client, can be accessing postgres database via different APIs.
+        Can be supabase client or psycopg2 based client.
+        Return a client for postgres DB.
+        """
+
+        raise NotImplementedError
+
+    async def _upsert(self, chunks: Dict[str, List[DocumentChunk]]) -> List[str]:
+        """
+        Takes in a dict of document_ids to list of document chunks and inserts them into the database.
+        Return a list of document ids.
+        """
+        for document_id, document_chunks in chunks.items():
+            for chunk in document_chunks:
+                json = {
+                    "id": chunk.id,
+                    "content": chunk.text,
+                    "embedding": chunk.embedding,
+                    "document_id": document_id,
+                    "source": chunk.metadata.source,
+                    "source_id": chunk.metadata.source_id,
+                    "url": chunk.metadata.url,
+                    "author": chunk.metadata.author,
+                }
+                if chunk.metadata.created_at:
+                    json["created_at"] = (
+                        datetime.fromtimestamp(
+                            to_unix_timestamp(chunk.metadata.created_at)
+                        ),
+                    )
+                await self.client.upsert("documents", json)
+
+        return list(chunks.keys())
+
+    async def _query(self, queries: List[QueryWithEmbedding]) -> List[QueryResult]:
+        """
+        Takes in a list of queries with embeddings and filters and returns a list of query results with matching document chunks and scores.
+        """
+        query_results: List[QueryResult] = []
+        for query in queries:
+            # get the top 3 documents with the highest cosine similarity using rpc function in the database called "match_page_sections"
+            params = {
+                "in_embedding": query.embedding,
+            }
+            if query.top_k:
+                params["in_match_count"] = query.top_k
+            if query.filter:
+                if query.filter.document_id:
+                    params["in_document_id"] = query.filter.document_id
+                if query.filter.source:
+                    params["in_source"] = query.filter.source.value
+                if query.filter.source_id:
+                    params["in_source_id"] = query.filter.source_id
+                if query.filter.author:
+                    params["in_author"] = query.filter.author
+                if query.filter.start_date:
+                    params["in_start_date"] = datetime.fromtimestamp(
+                        to_unix_timestamp(query.filter.start_date)
+                    )
+                if query.filter.end_date:
+                    params["in_end_date"] = datetime.fromtimestamp(
+                        to_unix_timestamp(query.filter.end_date)
+                    )
+            try:
+                data = await self.client.rpc("match_page_sections", params=params)
+                results: List[DocumentChunkWithScore] = []
+                for row in data:
+                    document_chunk = DocumentChunkWithScore(
+                        id=row["id"],
+                        text=row["content"],
+                        # TODO: add embedding to the response ?
+                        # embedding=row["embedding"],
+                        score=float(row["similarity"]),
+                        metadata=DocumentChunkMetadata(
+                            source=row["source"],
+                            source_id=row["source_id"],
+                            document_id=row["document_id"],
+                            url=row["url"],
+                            created_at=row["created_at"],
+                            author=row["author"],
+                        ),
+                    )
+                    results.append(document_chunk)
+                query_results.append(QueryResult(query=query.query, results=results))
+            except Exception as e:
+                print("error:", e)
+                query_results.append(QueryResult(query=query.query, results=[]))
+        return query_results
+
+    async def delete(
+        self,
+        ids: Optional[List[str]] = None,
+        filter: Optional[DocumentMetadataFilter] = None,
+        delete_all: Optional[bool] = None,
+    ) -> bool:
+        """
+        Removes vectors by ids, filter, or everything in the datastore.
+        Multiple parameters can be used at once.
+        Returns whether the operation was successful.
+        """
+        if delete_all:
+            try:
+                await self.client.delete_like("documents", "document_id", "%")
+            except:
+                return False
+        elif ids:
+            try:
+                await self.client.delete_in("documents", "document_id", ids)
+            except:
+                return False
+        elif filter:
+            try:
+                await self.client.delete_by_filters("documents", filter)
+            except:
+                return False
+        return True