Skip to content

Commit

Permalink
Merge pull request PromtEngineer#356 from PromtEngineer/api_update
Browse files Browse the repository at this point in the history
API Update
  • Loading branch information
PromtEngineer authored Aug 11, 2023
2 parents cc0cd4d + 13989b7 commit a3ba240
Show file tree
Hide file tree
Showing 6 changed files with 188 additions and 99 deletions.
53 changes: 19 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,16 @@

This project was inspired by the original [privateGPT](https://github.com/imartinez/privateGPT). Most of the description here is inspired by the original privateGPT.

For detailed overview of the project, Watch this [Youtube Video](https://youtu.be/MlyoObdIHyo).
For detailed overview of the project, Watch these videos
- [Detailed code-walkthrough](https://youtu.be/MlyoObdIHyo).
- [Llama-2 with LocalGPT](https://youtu.be/lbFmceo4D5E)
- [Adding Chat History](https://youtu.be/d7otIM_MCZs)

In this model, I have replaced the GPT4ALL model with Vicuna-7B model and we are using the InstructorEmbeddings instead of LlamaEmbeddings as used in the original privateGPT. Both Embeddings as well as LLM will run on GPU instead of CPU. It also has CPU support if you do not have a GPU (see below for instruction).

Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection!

Built with [LangChain](https://github.com/hwchase17/langchain) and [Vicuna-7B](https://huggingface.co/TheBloke/vicuna-7B-1.1-HF) and [InstructorEmbeddings](https://instructor-embedding.github.io/)
Built with [LangChain](https://github.com/hwchase17/langchain) and [Vicuna-7B](https://huggingface.co/TheBloke/vicuna-7B-1.1-HF) (+ alot more!) and [InstructorEmbeddings](https://instructor-embedding.github.io/)

# Environment Setup

Expand Down Expand Up @@ -148,27 +151,13 @@ CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no

# Run the UI

1. Start by opening up `run_localGPT_API.py` in a code editor of your choice. If you are using gpu skip to step 3.

2. If you are running on cpu change `DEVICE_TYPE = 'cuda'` to `DEVICE_TYPE = 'cpu'`.

- Comment out the following:

```shell
model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id, model_basename = model_basename)
```

- Uncomment:
1. Open `constants.py` in an editor of your choice and depending on choice add the LLM you want to use. By default, the following model will be used:

```shell
model_id = "TheBloke/guanaco-7B-HF" # or some other -HF or .bin model
LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id)
MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML"
MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"
```

- If you are running gpu there should be nothing to change. Save and close `run_localGPT_API.py`.

3. Open up a terminal and activate your python environment that contains the dependencies installed from requirements.txt.

4. Navigate to the `/LOCALGPT` directory.
Expand All @@ -190,39 +179,35 @@ CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no
Selecting the right local models and the power of `LangChain` you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance.

- `ingest.py` uses `LangChain` tools to parse the document and create embeddings locally using `InstructorEmbeddings`. It then stores the result in a local vector database using `Chroma` vector store.
- `run_localGPT.py` uses a local LLM (Vicuna-7B in this case) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
- `run_localGPT.py` uses a local LLM to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
- You can replace this local LLM with any other LLM from the HuggingFace. Make sure whatever LLM you select is in the HF format.

# How to select different LLM models?

The following will provide instructions on how you can select a different LLM model to create your response:

1. Open up `run_localGPT.py`
2. Go to `def main(device_type, show_sources)`
3. Go to the comment where it says `# load the LLM for generating Natural Language responses`
4. Below it, it details a bunch of examples on models from HuggingFace that have already been tested to be run with the original trained model (ending with HF or have a .bin in its "Files and versions"), and quantized models (ending with GPTQ or have a .no-act-order or .safetensors in its "Files and versions").
5. For models that end with HF or have a .bin inside its "Files and versions" on its HuggingFace page.
1. Open up `constants.py` in the editor of your choice.
2. Change the `MODEL_ID` and `MODEL_BASENAME`. If you are using a quantized model (`GGML`, `GPTQ`), you will need to provide `MODEL_BASENAME`. For unquatized models, set `MODEL_BASENAME` to `NONE`
5. There are a number of example models from HuggingFace that have already been tested to be run with the original trained model (ending with HF or have a .bin in its "Files and versions"), and quantized models (ending with GPTQ or have a .no-act-order or .safetensors in its "Files and versions").
6. For models that end with HF or have a .bin inside its "Files and versions" on its HuggingFace page.

- Make sure you have a model_id selected. For example -> `model_id = "TheBloke/guanaco-7B-HF"`
- Make sure you have a model_id selected. For example -> `MODEL_ID = "TheBloke/guanaco-7B-HF"`
- If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/guanaco-7B-HF) and go to "Files and versions" you will notice model files that end with a .bin extension.
- Any model files that contain .bin extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
- `model_id = "TheBloke/guanaco-7B-HF"`

`llm = load_model(device_type, model_id=model_id)`
- `MODEL_ID = "TheBloke/guanaco-7B-HF"`

6. For models that contain GPTQ in its name and or have a .no-act-order or .safetensors extension inside its "Files and versions on its HuggingFace page.
7. For models that contain GPTQ in its name and or have a .no-act-order or .safetensors extension inside its "Files and versions on its HuggingFace page.

- Make sure you have a model_id selected. For example -> model_id = `"TheBloke/wizardLM-7B-GPTQ"`
- You will also need its model basename file selected. For example -> `model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"`
- If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) and go to "Files and versions" you will notice a model file that ends with a .safetensors extension.
- Any model files that contain no-act-order or .safetensors extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
- `model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"`
- `MODEL_ID = "TheBloke/WizardLM-7B-uncensored-GPTQ"`

`model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"`
`MODEL_BASENAME = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"`

`llm = load_model(device_type, model_id=model_id, model_basename = model_basename)`

7. Comment out all other instances of `model_id="other model names"`, `model_basename=other base model names`, and `llm = load_model(args*)`
8. Comment out all other instances of `MODEL_ID="other model names"`, `MODEL_BASENAME=other base model names`, and `llm = load_model(args*)`

# System Requirements

Expand Down
36 changes: 36 additions & 0 deletions constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,39 @@
# You can also choose a smaller model, don't forget to change HuggingFaceInstructEmbeddings
# to HuggingFaceEmbeddings in both ingest.py and run_localGPT.py
# EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"

# Select the Model ID and model_basename
# load the LLM for generating Natural Language responses

MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML"
MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"

# for HF models
# MODEL_ID = "TheBloke/vicuna-7B-1.1-HF"
# MODEL_BASENAME = None
# MODEL_ID = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
# MODEL_ID = "TheBloke/guanaco-7B-HF"
# MODEL_ID = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM. Using STransformers
# alongside will 100% create OOM on 24GB cards.
# llm = load_model(device_type, model_id=model_id)

# for GPTQ (quantized) models
# MODEL_ID = "TheBloke/Nous-Hermes-13B-GPTQ"
# MODEL_BASENAME = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
# MODEL_ID = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
# MODEL_BASENAME = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors" # Requires
# ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
# MODEL_ID = "TheBloke/wizardLM-7B-GPTQ"
# MODEL_BASENAME = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
# MODEL_ID = "TheBloke/WizardLM-7B-uncensored-GPTQ"
# MODEL_BASENAME = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"

# for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
# MODEL_ID = "TheBloke/wizard-vicuna-13B-GGML"
# MODEL_BASENAME = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
# MODEL_BASENAME = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
# MODEL_BASENAME = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
# MODEL_ID = "TheBloke/orca_mini_3B-GGML"
# MODEL_BASENAME = "orca-mini-3b.ggmlv3.q4_0.bin"


119 changes: 119 additions & 0 deletions localGPT_UI.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
import torch
import subprocess
import streamlit as st
from run_localGPT import load_model
from langchain.vectorstores import Chroma
from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.chains import RetrievalQA
from streamlit_extras.add_vertical_space import add_vertical_space
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory



def model_memory():
# Adding history to the model.
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer,\
just say that you don't know, don't try to make up an answer.
{context}
{history}
Question: {question}
Helpful Answer:"""

prompt = PromptTemplate(input_variables=["history", "context", "question"], template=template)
memory = ConversationBufferMemory(input_key="question", memory_key="history")

return prompt, memory

# Sidebar contents
with st.sidebar:
st.title('🤗💬 Converse with your Data')
st.markdown('''
## About
This app is an LLM-powered chatbot built using:
- [Streamlit](https://streamlit.io/)
- [LangChain](https://python.langchain.com/)
- [LocalGPT](https://github.com/PromtEngineer/localGPT)
''')
add_vertical_space(5)
st.write('Made with ❤️ by [Prompt Engineer](https://youtube.com/@engineerprompt)')


DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"



if "result" not in st.session_state:
# Run the document ingestion process.
run_langest_commands = ["python", "ingest.py"]
run_langest_commands.append("--device_type")
run_langest_commands.append(DEVICE_TYPE)

result = subprocess.run(run_langest_commands, capture_output=True)
st.session_state.result = result

# Define the retreiver
# load the vectorstore
if "EMBEDDINGS" not in st.session_state:
EMBEDDINGS = HuggingFaceInstructEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={"device": DEVICE_TYPE})
st.session_state.EMBEDDINGS = EMBEDDINGS

if "DB" not in st.session_state:
DB = Chroma(
persist_directory=PERSIST_DIRECTORY,
embedding_function=st.session_state.EMBEDDINGS,
client_settings=CHROMA_SETTINGS,
)
st.session_state.DB = DB

if "RETRIEVER" not in st.session_state:
RETRIEVER = DB.as_retriever()
st.session_state.RETRIEVER = RETRIEVER

if "LLM" not in st.session_state:
LLM = load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
st.session_state["LLM"] = LLM




if "QA" not in st.session_state:

prompt, memory = model_memory()

QA = RetrievalQA.from_chain_type(
llm=LLM,
chain_type="stuff",
retriever=RETRIEVER,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt, "memory": memory},
)
st.session_state["QA"] = QA

st.title('LocalGPT App 💬')
# Create a text input box for the user
prompt = st.text_input('Input your prompt here')
# while True:

# If the user hits enter
if prompt:
# Then pass the prompt to the LLM
response = st.session_state["QA"](prompt)
answer, docs = response["result"], response["source_documents"]
# ...and write it out to the screen
st.write(answer)

# With a streamlit expander
with st.expander('Document Similarity Search'):
# Find the relevant pages
search = st.session_state.DB.similarity_search_with_score(prompt)
# Write out the first
for i, doc in enumerate(search):
# print(doc)
st.write(f"Source Document # {i+1} : {doc[0].metadata['source'].split('/')[-1]}")
st.write(doc[0].page_content)
st.write("--------------------------------")
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,9 @@ click
flask
requests

# Streamlit related
streamlit
Streamlit-extras

# Excel File Manipulation
openpyxl
48 changes: 8 additions & 40 deletions run_localGPT.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
pipeline,
)

from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY
from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME


def load_model(device_type, model_id, model_basename=None):
Expand Down Expand Up @@ -192,53 +192,21 @@ def main(device_type, show_sources):
client_settings=CHROMA_SETTINGS,
)
retriever = db.as_retriever()

# load the LLM for generating Natural Language responses

# for HF models
# model_id = "TheBloke/vicuna-7B-1.1-HF"
# model_basename = None
# model_id = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
# model_id = "TheBloke/guanaco-7B-HF"
# model_id = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM. Using STransformers
# alongside will 100% create OOM on 24GB cards.
# llm = load_model(device_type, model_id=model_id)

# for GPTQ (quantized) models
# model_id = "TheBloke/Nous-Hermes-13B-GPTQ"
# model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
# model_id = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
# model_basename = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors" # Requires
# ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
# model_id = "TheBloke/wizardLM-7B-GPTQ"
# model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
# model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
# model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"

# for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
# model_id = "TheBloke/wizard-vicuna-13B-GGML"
# model_basename = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
# model_basename = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
# model_basename = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
# model_id = "TheBloke/orca_mini_3B-GGML"
# model_basename = "orca-mini-3b.ggmlv3.q4_0.bin"

model_id = "TheBloke/Llama-2-7B-Chat-GGML"
model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"


template = """Use the following pieces of context to answer the question at the end. If you don't know the answer,\
just say that you don't know, don't try to make up an answer.
just say that you don't know, don't try to make up an answer.
{context}
{context}
{history}
Question: {question}
Helpful Answer:"""
{history}
Question: {question}
Helpful Answer:"""

prompt = PromptTemplate(input_variables=["history", "context", "question"], template=template)
memory = ConversationBufferMemory(input_key="question", memory_key="history")

llm = load_model(device_type, model_id=model_id, model_basename=model_basename)
llm = load_model(device_type, model_id=MODEL_ID, model_basename=MODEL_BASENAME)

qa = RetrievalQA.from_chain_type(
llm=llm,
Expand Down
27 changes: 2 additions & 25 deletions run_localGPT_API.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
)
from werkzeug.utils import secure_filename

from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY
from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME

DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"
SHOW_SOURCES = True
Expand Down Expand Up @@ -64,30 +64,7 @@

RETRIEVER = DB.as_retriever()

# for HF models
# model_id = "TheBloke/vicuna-7B-1.1-HF"
# model_id = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
# model_id = "TheBloke/guanaco-7B-HF"
# model_id = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM.
# Using STransformers alongside will 100% create OOM on 24GB cards.
# LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id)

# for GPTQ (quantized) models
# model_id = "TheBloke/Nous-Hermes-13B-GPTQ"
# model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
# model_id = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
# model_basename = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors"
# Requires ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
# model_id = "TheBloke/wizardLM-7B-GPTQ"
# model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"

# model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
# model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"

model_id = "TheBloke/Llama-2-7B-Chat-GGML"
model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"

LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id, model_basename=model_basename)
LLM = load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME)

QA = RetrievalQA.from_chain_type(
llm=LLM, chain_type="stuff", retriever=RETRIEVER, return_source_documents=SHOW_SOURCES
Expand Down

0 comments on commit a3ba240

Please sign in to comment.