In this article, we'll guide you through setting up your own Retrieval-Augmented Generation (RAG) system. This system allows you to upload your PDFs and ask a Language Model (LLM) about the information within those PDFs. The focus of this tutorial is on the blue section of the diagram, specifically not integrating Gradio at this stage. The tech stack involved includes:
- LLM: Llama2
- LLM API: llama.cpp service
- Langchain
- Vector DB: ChromaDB
- Embedding: sentence-Transformers
The cornerstone of this setup is Langchain, a framework for developing applications supported by language models. Langchain acts as a glue, offering various interfaces to connect LLM models with other tools and data sources. However, it's worth noting that Langchain is rapidly evolving, with frequent documentation and API updates. Below, we demonstrate the simplest way to set this up.
Set up your python environment. In this tutorial, I used conda to create the environment and installed the following libraries in a Jupyter environment.
pip install -r requirements.txt
We start by processing external information and storing it in the database for future knowledge queries. This step corresponds to the orange section of the diagram, specifically 1. Text Splitter and 2. Embedding.
Langchain offers around 55 types of document loaders, including loaders for Word, CSV, PDF, GoogleDrive, and YouTube. Here, we use PyMuPDFLoader to read in a resume. Note that PyMuPDF needs to be installed to use PyMuPDFLoader.
from langchain.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("Virtual_characters.pdf")
PDF_data = loader.load()
Text splitter splits documents or text into chunks to avoid exceeding the LLM's token limit. For this, we use RecursiveCharacterTextSplitter or CharacterTextSplitter. The main parameters include chunk_size (determining the max number of characters per chunk) and chunk_overlap (specifying the overlapping characters between consecutive chunks).
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=5)
all_splits = text_splitter.split_documents(PDF_data)
We use Embedding to convert the chunks of text from step b) into vectors. LangChain provides interfaces for many Embedding models.
from langchain.embeddings import HuggingFaceEmbeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
embedding = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
We store the results of the Embedding in VectorDB, using Chroma for implementation.
# Embed and store the texts
persist_directory = 'db'
vectordb = Chroma.from_documents(documents=all_splits, embedding=embedding, persist_directory=persist_directory)
There are two ways to initiate your LLM model and connect it to LangChain: using LangChain's LlamaCpp interface or setting up Llama2's API service through another method, such as llama.cpp's server. Here, both methods are demonstrated.
This method is simpler and involves using LlamaCpp's interface to load the model and initiate Llama's service.
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp
model_path = "llama.cpp/models/llama-2-7b-chat/llama-2_q4.gguf"
llm = LlamaCpp(
model_path=model_path,
n_gpu_layers=100,
n_batch=512,
n_ctx=2048,
f16_kv=True,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True,
)
If you have an established LLM API service, you'll need to use LangChain's ChatOpenAI interface.
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(openai_api_key='None', openai_api_base='http://127.0.0.1:8080/v1')
Some LLMs can use specific prompts. Here, we use ConditionalPromptSelector to set prompts based on the model type.
from langchain.chains import LLMChain
from langchain.chains.prompt_selector import ConditionalPromptSelector
from langchain.prompts import PromptTemplate
DEFAULT_LLAMA_SEARCH_PROMPT = PromptTemplate(
input_variables=["question"],
template="""<<SYS>> \n You are an assistant tasked with improving Google search \
results. \n <</SYS>> \n\n [INST] Generate THREE Google search queries that \
are similar to this question. The output should be a numbered list of questions \
and each should have a question mark at the end: \n\n {question} [/INST]""",
)
DEFAULT_SEARCH_PROMPT = PromptTemplate(
input_variables=["question"],
template="""You are an assistant tasked with improving Google search \
results. Generate THREE Google search queries that are similar to \
this question. The output should be a numbered list of questions and each \
should have a question mark at the end: {question}""",
)
QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(
default_prompt=DEFAULT_SEARCH_PROMPT,
conditionals=[(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)],
)
prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is Taiwan known for?"
llm_chain.invoke({"question": question})
We've stored PDF information in the database and initiated the LLM service. Now, we connect the entire RAG process:
- User sends a QA.
- Text Retrieval from the database.
- Combine QA with Text Retrieval and send to LLM.
- LLM responds based on the information. First, create a Retriever that returns corresponding documents based on unstructured QA. Then, combine Retriever, QA, and llm using RetrievalQA.
retriever = vectordb.as_retriever()
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
verbose=True
)
We've connected the entire RAG process. Let's query about the virtual character Alison Hawk from the PDF records.
query = "Tell me about Alison Hawk's career and age"
qa.invoke(query)