Skip to content

RebelRaider/ClickhouseRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClickhouseRAG

ClickhouseRAG is a Python package designed for efficient data access and management in Clickhouse. It provides an easy-to-use interface for connecting to Clickhouse, executing queries, and managing tables with support for Vectorizers and Retrieval-Augmented Generation (RAG) operations.

Features

  • Easy Clickhouse Connection: Seamlessly connect to your Clickhouse database.
  • Table Management: Effortlessly manage tables with CRUD operations.
  • Vectorization: Integrate with vectorizers for text and data embedding.
  • RAG Operations: Perform Retrieval-Augmented Generation tasks.
  • Backup and Restore: Backup your database to a file and restore it easily.
  • Cosine Similarity Search: Search data based on cosine similarity.

Installation

You can install ClickhouseRAG via pip:

pip install clickhouserag

Usage

Connecting to Clickhouse

Create a client to connect to your Clickhouse database.

from clickhouserag.data_access.clickhouse_client import ClickhouseConnectClient

client = ClickhouseConnectClient(
    host="localhost",
    port=9000,
    username="default",
    password="",
    database="default"
)
client.connect()

Defining Table Schema

Define the schema for your table in Clickhouse.

table_schema = {
    "id": "UInt32",
    "title": "String",
    "vector": "Array(Float64)"
}

Managing Tables

Create an instance of RAGManager to manage your table with the specified engine and schema.

from clickhouserag.rag.manager import RAGManager

rag_manager = RAGManager(client, "rag_table", table_schema, engine="MergeTree", order_by="id")

Creating and Adding Vectorizer

Create and add a Transformers vectorizer to the RAGManager.

import torch
from transformers import AutoModel, AutoTokenizer
from clickhouserag.vectorizers.base import VectorizerBase

class TransformersVectorizer(VectorizerBase):
    """Vectorizer that uses a Transformers model to convert text to vectors."""
    
    def __init__(self, model_name: str) -> None:
        """Initialize the TransformersVectorizer."""
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
    
    def vectorize(self, data: Any) -> List[float]:
        """Convert text data into a vector representation using a Transformers model."""
        if not isinstance(data, str):
            raise ValueError("Data should be a string for text vectorization.")
        
        inputs = self.tokenizer(data, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
            vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()
        
        return vector

    def bulk_vectorize(self, data: Any) -> List[List[float]]:
        """Convert listed text data into a vector representation using a Transformers model."""

        if not isinstance(data, List[str]):
            raise ValueError("Data should be a list of a strings for text vectorization.")

        inputs = self.tokenizer(
            data, return_tensors="pt", truncation=True, padding=True
        )
        with torch.no_grad():
            outputs = self.model(**inputs)
            vector = outputs.last_hidden_state.mean(dim=1).squeeze().tolist()

        return vector

transformers_vectorizer = TransformersVectorizer(model_name="distilbert-base-uncased")
rag_manager.add_vectorizer("transformers", transformers_vectorizer)

Adding Data with Vectorization

Add individual data records with vectorization through Transformers.

data = {"id": 1, "title": "Sample text data for transformers"}
rag_manager.add_data(data, vectorizer_name="transformers")

Bulk Adding Data with Vectorization

Add multiple data records with vectorization through Transformers.

bulk_data = [
    {"id": 2, "title": "Sample text data 1 for transformers"},
    {"id": 3, "title": "Sample text data 2 for transformers"},
    {"id": 4, "title": "Sample text data 3 for transformers"}
]
rag_manager.add_bulk_data(bulk_data, vectorizer_name="transformers")

Retrieving Data by ID

Retrieve data from the RAG by ID.

data = rag_manager.get_data(1)
print("Data with ID 1:", data)

Updating Data with Vectorization

Update data with vectorization through Transformers.

updated_data = {"id": 1, "title": "Updated text data for transformers"}
rag_manager.update_data(1, updated_data, vectorizer_name="transformers")

Executing Text Search

Perform a text search on the RAG.

query = "SELECT * FROM rag_table WHERE title LIKE '%Sample%'"
search_results = rag_manager.search(query)
print("Search results:", search_results)

Executing Cosine Similarity Search

Perform a cosine similarity search on the RAG.

import numpy as np

embedding = np.random.rand(768)  # Example random vector
similarity_results = rag_manager.similarity_search(embedding, top_k=2, columns=["id", "title"])
print("Similarity search results:", similarity_results)

Deleting Data

Delete data from the RAG by ID.

rag_manager.delete_data(1)

Backing Up the Database

Backup the database to a JSON file.

rag_manager.backup_database("backup.json")

Resetting and Restoring the Database

Reset and restore the database from a backup file.

rag_manager.reset_database()
rag_manager.restore_database("backup.json", table_schema=table_schema)

Closing the Database Connection

Close the connection to the Clickhouse database.

client.close()

Contributing

Contributions are welcome! Please read the contribution guidelines first.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any questions or inquiries, please contact Leonid Chesnikov at [email protected].

Project Structure

  • clickhouserag.data_access: Contains modules for managing Clickhouse connections and tables.
  • clickhouserag.rag: Contains modules for RAG operations and vectorizers.

Requirements

  • clickhouse-driver
  • numpy

These dependencies are automatically installed when you install the package via pip.

Development

To contribute to this project, follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes and commit them (git commit -am 'Add new feature').
  4. Push to the branch (git push origin feature-branch).
  5. Create a new Pull Request.

We appreciate your contributions and efforts in improving this project!

Keywords

  • Clickhouse
  • Data Access
  • Table Management
  • Vectorizer
  • RAG (Retrieval-Augmented Generation)

GitHub Repository

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages