Merge pull request PromtEngineer#356 from PromtEngineer/api_update

API Update
xeransis · Aug 11, 2023 · a3ba240 · a3ba240
2 parents cc0cd4d + 13989b7
commit a3ba240
Show file tree

Hide file tree

Showing 6 changed files with 188 additions and 99 deletions.
diff --git a/README.md b/README.md
@@ -2,13 +2,16 @@
 
 This project was inspired by the original [privateGPT](https://github.com/imartinez/privateGPT). Most of the description here is inspired by the original privateGPT.
 
-For detailed overview of the project, Watch this [Youtube Video](https://youtu.be/MlyoObdIHyo).
+For detailed overview of the project, Watch these videos
+- [Detailed code-walkthrough](https://youtu.be/MlyoObdIHyo).
+- [Llama-2 with LocalGPT](https://youtu.be/lbFmceo4D5E)
+- [Adding Chat History](https://youtu.be/d7otIM_MCZs)
 
 In this model, I have replaced the GPT4ALL model with Vicuna-7B model and we are using the InstructorEmbeddings instead of LlamaEmbeddings as used in the original privateGPT. Both Embeddings as well as LLM will run on GPU instead of CPU. It also has CPU support if you do not have a GPU (see below for instruction).
 
 Ask questions to your documents without an internet connection, using the power of LLMs. 100% private, no data leaves your execution environment at any point. You can ingest documents and ask questions without an internet connection!
 
-Built with [LangChain](https://github.com/hwchase17/langchain) and [Vicuna-7B](https://huggingface.co/TheBloke/vicuna-7B-1.1-HF) and [InstructorEmbeddings](https://instructor-embedding.github.io/)
+Built with [LangChain](https://github.com/hwchase17/langchain) and [Vicuna-7B](https://huggingface.co/TheBloke/vicuna-7B-1.1-HF) (+ alot more!) and [InstructorEmbeddings](https://instructor-embedding.github.io/)
 
 # Environment Setup
 
@@ -148,27 +151,13 @@ CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no
 
 # Run the UI
 
-1. Start by opening up `run_localGPT_API.py` in a code editor of your choice. If you are using gpu skip to step 3.
-
-2. If you are running on cpu change `DEVICE_TYPE = 'cuda'` to `DEVICE_TYPE = 'cpu'`.
-
-   - Comment out the following:
-
-   ```shell
-   model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
-   model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
-   LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id, model_basename = model_basename)
-   ```
-
-   - Uncomment:
+1. Open `constants.py` in an editor of your choice and depending on choice add the LLM you want to use. By default, the following model will be used:
 
    ```shell
-   model_id = "TheBloke/guanaco-7B-HF" # or some other -HF or .bin model
-   LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id)
+   MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML"
+   MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"
    ```
 
-   - If you are running gpu there should be nothing to change. Save and close `run_localGPT_API.py`.
-
 3. Open up a terminal and activate your python environment that contains the dependencies installed from requirements.txt.
 
 4. Navigate to the `/LOCALGPT` directory.
@@ -190,39 +179,35 @@ CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no
 Selecting the right local models and the power of `LangChain` you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance.
 
 - `ingest.py` uses `LangChain` tools to parse the document and create embeddings locally using `InstructorEmbeddings`. It then stores the result in a local vector database using `Chroma` vector store.
-- `run_localGPT.py` uses a local LLM (Vicuna-7B in this case) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
+- `run_localGPT.py` uses a local LLM to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
 - You can replace this local LLM with any other LLM from the HuggingFace. Make sure whatever LLM you select is in the HF format.
 
 # How to select different LLM models?
 
 The following will provide instructions on how you can select a different LLM model to create your response:
 
-1. Open up `run_localGPT.py`
-2. Go to `def main(device_type, show_sources)`
-3. Go to the comment where it says `# load the LLM for generating Natural Language responses`
-4. Below it, it details a bunch of examples on models from HuggingFace that have already been tested to be run with the original trained model (ending with HF or have a .bin in its "Files and versions"), and quantized models (ending with GPTQ or have a .no-act-order or .safetensors in its "Files and versions").
-5. For models that end with HF or have a .bin inside its "Files and versions" on its HuggingFace page.
+1. Open up `constants.py` in the editor of your choice.
+2. Change the `MODEL_ID` and `MODEL_BASENAME`. If you are using a quantized model (`GGML`, `GPTQ`), you will need to provide `MODEL_BASENAME`. For unquatized models, set `MODEL_BASENAME` to `NONE`
+5. There are a number of example models from HuggingFace that have already been tested to be run with the original trained model (ending with HF or have a .bin in its "Files and versions"), and quantized models (ending with GPTQ or have a .no-act-order or .safetensors in its "Files and versions").
+6. For models that end with HF or have a .bin inside its "Files and versions" on its HuggingFace page.
 
-   - Make sure you have a model_id selected. For example -> `model_id = "TheBloke/guanaco-7B-HF"`
+   - Make sure you have a model_id selected. For example -> `MODEL_ID = "TheBloke/guanaco-7B-HF"`
    - If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/guanaco-7B-HF) and go to "Files and versions" you will notice model files that end with a .bin extension.
    - Any model files that contain .bin extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
-   - `model_id = "TheBloke/guanaco-7B-HF"`
-
-     `llm = load_model(device_type, model_id=model_id)`
+   - `MODEL_ID = "TheBloke/guanaco-7B-HF"`
 
-6. For models that contain GPTQ in its name and or have a .no-act-order or .safetensors extension inside its "Files and versions on its HuggingFace page.
+7. For models that contain GPTQ in its name and or have a .no-act-order or .safetensors extension inside its "Files and versions on its HuggingFace page.
 
    - Make sure you have a model_id selected. For example -> model_id = `"TheBloke/wizardLM-7B-GPTQ"`
    - You will also need its model basename file selected. For example -> `model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"`
    - If you go to its HuggingFace [repo](https://huggingface.co/TheBloke/wizardLM-7B-GPTQ) and go to "Files and versions" you will notice a model file that ends with a .safetensors extension.
    - Any model files that contain no-act-order or .safetensors extensions will be run with the following code where the `# load the LLM for generating Natural Language responses` comment is found.
-   - `model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"`
+   - `MODEL_ID = "TheBloke/WizardLM-7B-uncensored-GPTQ"`
 
-     `model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"`
+     `MODEL_BASENAME = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"`
 
-     `llm = load_model(device_type, model_id=model_id, model_basename = model_basename)`
 
-7. Comment out all other instances of `model_id="other model names"`, `model_basename=other base model names`, and `llm = load_model(args*)`
+8. Comment out all other instances of `MODEL_ID="other model names"`, `MODEL_BASENAME=other base model names`, and `llm = load_model(args*)`
 
 # System Requirements
 

diff --git a/constants.py b/constants.py
@@ -40,3 +40,39 @@
 # You can also choose a smaller model, don't forget to change HuggingFaceInstructEmbeddings
 # to HuggingFaceEmbeddings in both ingest.py and run_localGPT.py
 # EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
+
+# Select the Model ID and model_basename
+# load the LLM for generating Natural Language responses
+
+MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML"
+MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"
+
+# for HF models
+# MODEL_ID = "TheBloke/vicuna-7B-1.1-HF"
+# MODEL_BASENAME = None
+# MODEL_ID = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
+# MODEL_ID = "TheBloke/guanaco-7B-HF"
+# MODEL_ID = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM. Using STransformers
+# alongside will 100% create OOM on 24GB cards.
+# llm = load_model(device_type, model_id=model_id)
+
+# for GPTQ (quantized) models
+# MODEL_ID = "TheBloke/Nous-Hermes-13B-GPTQ"
+# MODEL_BASENAME = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
+# MODEL_ID = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
+# MODEL_BASENAME = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors" # Requires
+# ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
+# MODEL_ID = "TheBloke/wizardLM-7B-GPTQ"
+# MODEL_BASENAME = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
+# MODEL_ID = "TheBloke/WizardLM-7B-uncensored-GPTQ"
+# MODEL_BASENAME = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
+
+# for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
+# MODEL_ID = "TheBloke/wizard-vicuna-13B-GGML"
+# MODEL_BASENAME = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
+# MODEL_BASENAME = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
+# MODEL_BASENAME = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
+# MODEL_ID = "TheBloke/orca_mini_3B-GGML"
+# MODEL_BASENAME = "orca-mini-3b.ggmlv3.q4_0.bin"
+
+
diff --git a/localGPT_UI.py b/localGPT_UI.py
@@ -0,0 +1,119 @@
+import torch
+import subprocess
+import streamlit as st
+from run_localGPT import load_model
+from langchain.vectorstores import Chroma
+from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME
+from langchain.embeddings import HuggingFaceInstructEmbeddings
+from langchain.chains import RetrievalQA
+from streamlit_extras.add_vertical_space import add_vertical_space
+from langchain.prompts import PromptTemplate
+from langchain.memory import ConversationBufferMemory
+
+
+
+def model_memory():
+    # Adding history to the model.
+    template = """Use the following pieces of context to answer the question at the end. If you don't know the answer,\
+    just say that you don't know, don't try to make up an answer.
+
+    {context}
+
+    {history}
+    Question: {question}
+    Helpful Answer:"""
+
+    prompt = PromptTemplate(input_variables=["history", "context", "question"], template=template)
+    memory = ConversationBufferMemory(input_key="question", memory_key="history")
+
+    return prompt, memory
+
+# Sidebar contents
+with st.sidebar:
+    st.title('🤗💬 Converse with your Data')
+    st.markdown('''
+    ## About
+    This app is an LLM-powered chatbot built using:
+    - [Streamlit](https://streamlit.io/)
+    - [LangChain](https://python.langchain.com/)
+    - [LocalGPT](https://github.com/PromtEngineer/localGPT) 
+ 
+    ''')
+    add_vertical_space(5)
+    st.write('Made with ❤️ by [Prompt Engineer](https://youtube.com/@engineerprompt)')
+
+
+DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"
+
+
+
+if "result" not in st.session_state:
+    # Run the document ingestion process. 
+    run_langest_commands = ["python", "ingest.py"]
+    run_langest_commands.append("--device_type")
+    run_langest_commands.append(DEVICE_TYPE)
+
+    result = subprocess.run(run_langest_commands, capture_output=True)
+    st.session_state.result = result
+
+# Define the retreiver
+# load the vectorstore
+if "EMBEDDINGS" not in st.session_state:
+    EMBEDDINGS = HuggingFaceInstructEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={"device": DEVICE_TYPE})
+    st.session_state.EMBEDDINGS = EMBEDDINGS
+
+if "DB" not in st.session_state:
+    DB = Chroma(
+        persist_directory=PERSIST_DIRECTORY,
+        embedding_function=st.session_state.EMBEDDINGS,
+        client_settings=CHROMA_SETTINGS,
+    )
+    st.session_state.DB = DB
+
+if "RETRIEVER" not in st.session_state:
+    RETRIEVER = DB.as_retriever()
+    st.session_state.RETRIEVER = RETRIEVER
+
+if "LLM" not in st.session_state:
+    LLM = load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
+    st.session_state["LLM"] = LLM
+
+
+
+
+if "QA" not in st.session_state:
+
+    prompt, memory = model_memory()
+
+    QA = RetrievalQA.from_chain_type(
+        llm=LLM, 
+        chain_type="stuff", 
+        retriever=RETRIEVER, 
+        return_source_documents=True,
+        chain_type_kwargs={"prompt": prompt, "memory": memory},
+    )
+    st.session_state["QA"] = QA
+
+st.title('LocalGPT App 💬')
+    # Create a text input box for the user
+prompt = st.text_input('Input your prompt here')
+# while True:
+
+    # If the user hits enter
+if prompt:
+    # Then pass the prompt to the LLM
+    response = st.session_state["QA"](prompt)
+    answer, docs = response["result"], response["source_documents"]
+    # ...and write it out to the screen
+    st.write(answer)
+
+    # With a streamlit expander  
+    with st.expander('Document Similarity Search'):
+        # Find the relevant pages
+        search = st.session_state.DB.similarity_search_with_score(prompt) 
+        # Write out the first
+        for i, doc in enumerate(search): 
+            # print(doc)
+            st.write(f"Source Document # {i+1} : {doc[0].metadata['source'].split('/')[-1]}")
+            st.write(doc[0].page_content) 
+            st.write("--------------------------------")
diff --git a/requirements.txt b/requirements.txt
@@ -24,5 +24,9 @@ click
 flask
 requests
 
+# Streamlit related
+streamlit
+Streamlit-extras
+
 # Excel File Manipulation
 openpyxl
diff --git a/run_localGPT.py b/run_localGPT.py
@@ -21,7 +21,7 @@
     pipeline,
 )
 
-from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY
+from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME
 
 
 def load_model(device_type, model_id, model_basename=None):
@@ -192,53 +192,21 @@ def main(device_type, show_sources):
         client_settings=CHROMA_SETTINGS,
     )
     retriever = db.as_retriever()
-
-    # load the LLM for generating Natural Language responses
-
-    # for HF models
-    # model_id = "TheBloke/vicuna-7B-1.1-HF"
-    # model_basename = None
-    # model_id = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
-    # model_id = "TheBloke/guanaco-7B-HF"
-    # model_id = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM. Using STransformers
-    # alongside will 100% create OOM on 24GB cards.
-    # llm = load_model(device_type, model_id=model_id)
-
-    # for GPTQ (quantized) models
-    # model_id = "TheBloke/Nous-Hermes-13B-GPTQ"
-    # model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
-    # model_id = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
-    # model_basename = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors" # Requires
-    # ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
-    # model_id = "TheBloke/wizardLM-7B-GPTQ"
-    # model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
-    # model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
-    # model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
-
-    # for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
-    # model_id = "TheBloke/wizard-vicuna-13B-GGML"
-    # model_basename = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
-    # model_basename = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
-    # model_basename = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
-    # model_id = "TheBloke/orca_mini_3B-GGML"
-    # model_basename = "orca-mini-3b.ggmlv3.q4_0.bin"
-
-    model_id = "TheBloke/Llama-2-7B-Chat-GGML"
-    model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
+
 
     template = """Use the following pieces of context to answer the question at the end. If you don't know the answer,\
-just say that you don't know, don't try to make up an answer.
+    just say that you don't know, don't try to make up an answer.
 
-{context}
+    {context}
 
-{history}
-Question: {question}
-Helpful Answer:"""
+    {history}
+    Question: {question}
+    Helpful Answer:"""
 
     prompt = PromptTemplate(input_variables=["history", "context", "question"], template=template)
     memory = ConversationBufferMemory(input_key="question", memory_key="history")
 
-    llm = load_model(device_type, model_id=model_id, model_basename=model_basename)
+    llm = load_model(device_type, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
 
     qa = RetrievalQA.from_chain_type(
         llm=llm,

diff --git a/run_localGPT_API.py b/run_localGPT_API.py
@@ -25,7 +25,7 @@
 )
 from werkzeug.utils import secure_filename
 
-from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY
+from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY, MODEL_ID, MODEL_BASENAME
 
 DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"
 SHOW_SOURCES = True
@@ -64,30 +64,7 @@
 
 RETRIEVER = DB.as_retriever()
 
-# for HF models
-# model_id = "TheBloke/vicuna-7B-1.1-HF"
-# model_id = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
-# model_id = "TheBloke/guanaco-7B-HF"
-# model_id = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM.
-# Using STransformers alongside will 100% create OOM on 24GB cards.
-# LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id)
-
-# for GPTQ (quantized) models
-# model_id = "TheBloke/Nous-Hermes-13B-GPTQ"
-# model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
-# model_id = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
-# model_basename = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors"
-# Requires ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
-# model_id = "TheBloke/wizardLM-7B-GPTQ"
-# model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
-
-# model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
-# model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
-
-model_id = "TheBloke/Llama-2-7B-Chat-GGML"
-model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
-
-LLM = load_model(device_type=DEVICE_TYPE, model_id=model_id, model_basename=model_basename)
+LLM = load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
 
 QA = RetrievalQA.from_chain_type(
     llm=LLM, chain_type="stuff", retriever=RETRIEVER, return_source_documents=SHOW_SOURCES