This AI Starter Kit is primarily intended to show off the speed of Llama 3 8B in Samba-1 Turbo for low latency agentic workflows. The Kit includes:
- A configurable SambaStudio connector. The connector generates answers from a deployed model.
- A configurable integration with a third-party vector database.
- An implementation of a semantic search workflow using numerous chains via LangGraph.
This sample is ready-to-use. We provide:
- Instructions for setup with SambaStudio or Sambaverse.
- Instructions for running the model as is.
You have to set up your environment before you can run or customize the starter kit.
Clone the starter kit repo.
git clone https://github.com/sambanova/ai-starter-kit.git
The next step sets you up to use one of the models available from SambaNova. It depends on whether you're a SambaNova customer who uses SambaStudio or you want to use the publicly available Sambaverse.
For this workshop we will be focusing on SambaStudio, since it will host the Llama 3 8B model that resides within our Samba-1 Turbo Composition of Experts. Skip the Sambaverse setup unless you would like to test our models on your own via our hosted service. The performance will not be optimized when using Sambaverse
- Create a Sambaverse account at Sambaverse and select your model.
- Get your Sambaverse API key (from the user button).
- In the repo root directory find the config file in
sn-ai-starter-kit/.env
and specify the Sambaverse API key (with no spaces), as in the following example:
SAMBAVERSE_API_KEY="456789ab-cdef-0123-4567-89abcdef0123"
- In the config file, set the
api
variable to"sambaverse"
.
To perform this setup, you will be using a hosted endpoint that has been setup for this workshop. In enterprise settings, you must be a SambaNova customer with a SambaStudio account. The endpoint information will be shared in the workshop. For customers:
- Log in to SambaStudio and get your API authorization key. The steps for getting this key are described here.
- Select the LLM you want to use (e.g. Llama 2 70B chat) and deploy an endpoint for inference. See the SambaStudio endpoint documentation.
- Update the
sn-ai-starter-kit/.env
config file in the root repo directory. Here's an example:
BASE_URL="https://api-stage.sambanova.net"
PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0"
ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123"
API_KEY="89abcdef-0123-4567-89ab-cdef01234567"
- Open the config file, set the variable
api
to"sambastudio"
, and save the file
You have these options to specify the embedding API info:
-
Option 1: Use a CPU embedding model
In the config file, set the variable
embedding_model:
to"cpu"
-
Option 2: Set a SambaStudio embedding model
To increase inference speed, you can use SambaStudio E5 embedding model endpoint instead of using the default (CPU) Hugging Face embeddings, Follow this guide to deploy your SambaStudio embedding model. For the workshop, we will provide the E5 endpoint with batch size 32 for inference.
NOTE: Be sure to set batch size model parameter to 32.
- Update API information for the SambaNova embedding endpoint in the
sn-ai-starter-kit/.env
file in the root repo directory. For example:
- Assume you have an endpoint with the URL "https://api-stage.sambanova.net/api/predict/nlp/12345678-9abc-def0-1234-56789abcdef0/456789ab-cdef-0123-4567-89abcdef0123"
- You can enter the following in the env file (with no spaces):
EMBED_BASE_URL="https://api-stage.sambanova.net" EMBED_PROJECT_ID="12345678-9abc-def0-1234-56789abcdef0" EMBED_ENDPOINT_ID="456789ab-cdef-0123-4567-89abcdef0123" EMBED_API_KEY="89abcdef-0123-4567-89ab-cdef01234567"
- In the config file, set the variable
embedding_model
to"sambastudio"
NOTE: Using different embedding models (cpu or sambastudio) may change the results, and change How the embedding model is set and what the parameters are.
You can see the difference in how they are set in the vectordb.py file (
load_embedding_model method
). - Update API information for the SambaNova embedding endpoint in the
We recommend that you run the starter kit in a virtual environment or use a container.
If you want to use virtualenv or conda environment:
- Install and update pip.
- Mac
cd ai_starter_kit/
python3 -m venv complex_rag_env
source complex_rag_env/bin/activate
pip install --upgrade pip
pip install -r complex_rag/requirements.txt
- Windows
cd ai_starter_kit/
python3 -m venv complex_rag_env
complex_rag>complex_rag_env\Scripts\activate
pip install --upgrade pip
pip install -r complex_rag\requirements.txt
- Run the following command:
streamlit run complex_rag/streamlit/app.py --browser.gatherUsageStats false
After you've deployed the GUI, you can use the start kit. Follow these steps:
-
In the Pick a data source pane, drag and drop or browse for files. The data source can be a Chroma vectorstore or a series of PDF files.
-
Click Process to process all loaded PDFs. A vectorstore is created in memory. You can store on disk if you want.
-
In the main panel, you can ask questions about the PDF data.
This workflow uses the AI starter kit as is with an ingestion, retrieval, response workflow.
This workflow, included with this starter kit, is an example of parsing and indexing data for subsequent Q&A. The steps are:
-
Document parsing: Python packages pypdf2, fitz and unstructured are used to extract text from PDF documents. On the LangChain website, multiple integrations for text extraction from PDF are available. Depending on the quality and the format of the PDF files, this step might require customization for different use cases. For TXT file loading, the default txt loading implementation of langchain is used.
-
Split data: After the data has been parsed and its content extracted, we need to split the data into chunks of text to be embedded and stored in a vector database. The size of the chunks of text depends on the context (sequence) length offered by the model. Generally, larger context lengths result in better performance. The method used to split text has an impact on performance (for instance, making sure there are no word breaks, sentence breaks, etc.). The downloaded data is split using RecursiveCharacterTextSplitter.
-
Embed data: For each chunk of text from the previous step, we use an embeddings model to create a vector representation of the text. These embeddings are used in the storage and retrieval of the most relevant content given a user's query. The split text is embedded using HuggingFaceInstructEmbeddings.
NOTE: For more information about what an embeddings is click here*
-
Store embeddings: Embeddings for each chunk, along with content and relevant metadata (such as source documents) are stored in a vector database. The embedding acts as the index in the database. In this template, we store information with each entry, which can be modified to suit your needs. There are several vector database options available, each with their own pros and cons. This starter kit is set up to use Chroma as the vector database because it is a free, open-source option with straightforward setup, but can easily be updated to use another if desired. In terms of metadata,
filename
andpage
are also attached to the embeddings which are extracted during parsing of the PDF documents.
This workflow is an example of leveraging data stored in a vector database along with a large language model to enable retrieval-based Q&A off your data. The steps are:
-
Embed query: The first step is to convert a user-submitted query to a common representation (an embedding) for subsequent use in identifying the most relevant stored content. Use the same embedding mode for query parsing and to generate embeddings. In this start kit, the query text is embedded using HuggingFaceInstructEmbeddings, which is the same embeddng model in the ingestion workflow.
-
Retrieve relevant content: Next, we use the embeddings representation of the query to make a retrieval request from the vector database, which in turn returns relevant entries (content) in it. The vector database therefore also acts as a retriever for fetching relevant information from the database.
More information about embeddings and their retrieval here
Find more information about Retrieval augmented generation with LangChain here
After the relevant information is retrieved, the content is sent to the LangGraph app that includes numerous Llama 3 8B calls. Calls at conditional nodes reliably output JSON formatted strings and are parsed by Langchain's JSON output parser. The value obtained decides the branch to follow in the graph.
- streamlit (version 1.25.0)
- langchain (version 0.2.1)
- langchain-community (version 0.2.1)
- langgraph (version 0.5.5)
- pyppeteer (version 2.0.0)
- datasets (version 2.19.1)
- sentence_transformers (version 2.2.2)
- instructorembedding (version 1.0.1)
- chromadb (version 0.4.24)
- PyPDF2 (version 3.0.1)
- unstructured_inference (version 0.7.27)
- unstructured[pdf] (version 0.13.3)
- PyMuPDF (version 1.23.4)
- python-dotenv (version 1.0.0)
The following work aims to show the power of SambaNova Systems RDU acceleration, using Samba-1 Turbo. The work herein has been leveraged and adapted from the great folks at LangGraph. Some of the adaptations of the original works also demonstrate how to modularize different components of the LangGraph setup and implement in Streamlit for rapid, early development. The original tutorial can be found here: