VectorFlow is an open source, high throughput, fault tolerant vector embedding pipeline. With a simple API request, you can send raw data that will be embedded and stored in any vector database or returned back to you.
This current version is an MVP and should not be used in production yet. Right now the system only supports uploading single TXT or PODF files at a time, up to 25 MB.
The best way to run VectorFlow is via docker compose
.
First create a folder in the root for all the environment variables:
mkdir env_scripts
cd env_scripts
touch env_vars.env
This creates a file called env_vars.env
in the env_scripts
folder to add all the environment variables mentioned below.
INTERNAL_API_KEY=your-choice
POSTGRES_USERNAME=postgres
POSTGRES_PASSWORD=your-choice
POSTGRES_DB=vectorflow
POSTGRES_HOST=postgres
RABBITMQ_USERNAME=guest
RABBITMQ_PASSWORD=guest
RABBITMQ_HOST=rabbitmq
EMBEDDING_QUEUE=embeddings
VDB_UPLOAD_QUEUE=vdb-upload
LOCAL_VECTOR_DB=qdrant | milvus | weaviate
You can choose a variable for INTERNAL_API_KEY
, POSTGRES_PASSWORD
, and POSTGRES_DB
, but they must be set.
Make sure you pull Rabbit MQ and Postgres into your local docker repo. We also recommend running a vector DB in locally, so make sure to pull the image of the one you are using:
docker pull rabbitmq
docker pull postgres
docker pull qdrant/qdrant | docker pull milvusdb/milvus | docker pull semitechnologies/weaviate
Then run:
docker-compose build --no-cache
docker-compose up -d
Note that the db-init
container is running a script that sets up the database schema and will stop after the script completes.
VectorFlow can run any Sentence Transformer model but the docker-compose
file will not spin it up automatically. Either run app.py --model_name your-sentence-transformer-model
, or build and run the docker image in src/hugging_face
with:
docker build --file hugging_face/Dockerfile -t vectorflow_hf:latest .
docker run --network=vectorflow --name=vectorflow_hf -d --env-file=/path/to/.env vectorflow_hf:latest --model_name "your_model_name_here"
Note that the Sentence Transformer models can be large and take several minutes to download from Hugging Face. VectorFlow does not provision hardware, so you must ensure your hardware has enough RAM/VRAM for the model. By default, VectorFlow will run models on GPU with CUDA if available.
To use VectorFlow in a live system, make an HTTP request to your API's URL at port 8000 - for example, localhost:8000
from your development machine, or vectorflow_api:8000
from within another docker container.
All requests require an HTTP Header with Authorization
key which is the same as your INTERNAL_API_KEY
env var that you defined before (see above). You must pass your vector database api key with the HTTP Header X-VectorDB-Key
and the embedding api key with X-EmbeddingAPI-Key
.
VectorFlow currently supports OpenAI ADA embeddings and Pinecone, Qdrant, Weaviate, and Milvus vector databases.
To check the status of a job
, make a GET
request to this endpoint: /jobs/<int:job_id>/status
. The response will be in the form:
{
'JobStatus': job_status.value
}
To submit a job
for embedding, make a POST
request to this endpoint: /embed
with the following payload and the 'Content-Type: multipart/form-data'
header:
{
'SourceData=path_to_txt_file'
'LinesPerBatch=4096'
'EmbeddingsMetadata={
"embeddings_type": "OPEN_AI | HUGGING_FACE",
"chunk_size": 512,
"chunk_overlap": 128,
"chunk_strategy": "EXACT | PARAGRAPH | SENTENCE",
"hugging_face_model_name": "model-name-here"
}'
'VectorDBMetadata={
"vector_db_type": "PINECONE | QDRANT | WEAVIATE | MILVUS",
"index_name": "index_name",
"environment": "env_name"
}'
}
You will get the following payload back:
{
message': f"Successfully added {batch_count} batches to the queue",
'JobID': job_id
}
The following request will embed a TXT document with OpenAI's ADA model and upload the results to a Pinecone index called test
. Make sure that your Pinecone index is called test
. If you run the curl command from the root directory the path to the test_text.txt is ./src/api/tests/fixtures/test_text.txt
, changes this if you want to use another TXT document to embed.
curl -X POST -H 'Content-Type: multipart/form-data' -H "Authorization: INTERNAL_API_KEY" -H "X-EmbeddingAPI-Key: your-key-here" -H "X-VectorDB-Key: your-key-here" -F 'EmbeddingsMetadata={"embeddings_type": "open_ai", "chunk_size": 256, "chunk_overlap": 128}' -F 'SourceData=@./src/api/tests/fixtures/test_text.txt' -F 'VectorDBMetadata={"vector_db_type": "pinecone", "index_name": "test", "environment": "us-east-1-aws"}' http://localhost:8000/embed
To check the status of the job,
curl -X GET -H "Authorization: INTERNAL_API_KEY" http://localhost:8000/jobs/<job_id>/status
VectorFlow enforces a standardized schema for uploading data to a vector store:
id: int
source_data: string
embeddings: float array
The id can be used for deduplication and idempotency. Please note for Weaviate, the id is called vectorflow_id
. We plan to support dynamically detect and/or configurable schemas down the road.
VectorFlow is integrated with AWS s3. You can pass a pre-signed s3 URL in the body of the HTTP instead of a file. Use the form field PreSignedURL
and hit the endpoint /s3
. This endpoint has the same configuration and restrictions as the /embed
endpoint.
We love feedback from the community. If you have an idea of how to make this project better, we encourage you to open an issue or join our Discord. Please tag dgarnitz
and danmeier2
.
Our roadmap is outlined in the section below and we would love help in building it out. Our open issues are a great place to start and can be viewed here. If you want to work on something not listed there, we recommend you open an issue with a proposed approach in mind before submitting a PR.
Please tag dgarnitz
on all PRs.
When submitting a PR, please add units tests to cover the functionality you have added. Please re-run existing tests to ensure there are no regressive bugs. Run from the src
directory. To run an individual test use:
python -m unittest module.tests.test_file.TestClass.test_method
To run all the tests in the file use:
python -m unittest module.tests.test_file
For end-to-end testing, it is recommend to build and run using the docker-compose, but take down the container you are altering and run it locally on your development machine. This will avoid the need to constantly rebuild the images and re-run the containers. Make sure to change the environment variables in your development machine terminal to the correct values (i.e. localhost
instead of rabbitmq
or postgres
) so that the docker containers can communicate with your development machine. Once it works locally you can perform a final test with everything in docker-compose.
Please verify that all changes work with docker-compose before opening a PR.
We also recommend you add verification evidence, such as screenshots, that show that your code works in an end to end flow.
- Connectors to other vector databases
- Support for more files types such as
csv
,word
,xls
, etc - Support for multi-file, directory data ingestion from sources such as Salesforce, Google Drive, etc
- Retry mechanism
- Langchain & Llama Index integrations
- Support callbacks for writing object metadata to a separate store
- Dynamically configurable vector DB schemas
- Deduplication capabilities