Environment upgrade & Modules rework

Python to 3.11 Torch to 2.2.2 Deepspeed updated to match new environment Modules rework so that its easier to inject text into the prompt
kimjammer · Apr 9, 2024 · 1f93c1e · 1f93c1e
1 parent f1372f6
commit 1f93c1e
Show file tree

Hide file tree

Showing 17 changed files with 209 additions and 171 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,7 @@
 venv/
 models/
 voices/
-deepspeed-0.11.2+cuda118-cp310-cp310-win_amd64.whl
+deepspeed-0.14.0+cu118-cp311-cp311-win_amd64.whl
 .fleet/
 __pycache__/
 .env

diff --git a/README.md b/README.md
@@ -15,14 +15,14 @@ running [Mistral 7B Instruct v0.2 GPTQ](https://huggingface.co/TheBloke/Mistral-
 ExLlamaV2_HF loader with cache_8bit turned on. The openai api extension must be turned on, as this is how we interact
 with the LLM. text-generation-webui and the LLM must be installed and started separately.
 
-Alternatively, you can load any other model into text-generation-webui or modify llmWrapper.py to point to any other
-openapi compatible endpoint.
+Alternatively, you can load any other model into text-generation-webui or modify constants.py to point to any other
+openapi compatible endpoint. Note that this project uses some parameters only on text-generation-webui.
 
 ### STT
 
 This project uses the excellent [KoljaB/RealtimeSTT](https://github.com/KoljaB/RealtimeSTT), which can transcribe an
 incoming audio stream, not just a file. This means that the text is transcribed as the person is talking, and so
-transcription ends almost immedeatly after speech ends. It is configured to use the faster_whisper tiny.en model.
+transcription ends almost immediately after speech ends. It is configured to use the faster_whisper tiny.en model.
 
 ### TTS
 
@@ -41,11 +41,20 @@ the Installation Section for more details.
 
 ### Modularization
 
-Each concern of the program is separated out into its own python file. A single signals object is created and passed to
-every module, and each module can read and write to the same signals object to share state and data. tts.py and stt.py
-handle the TTS and STT, the llmWrapper.py is responsible for interfacing with the LLM API, and prompter.py is
+Each concern of the program is separated out into its own python file/class. A single signals object is created and 
+passed to every class, and each class can read and write to the same signals object to share state and data. tts.py and 
+stt.py handle the TTS and STT, the llmWrapper.py is responsible for interfacing with the LLM API, and prompter.py is
 responsible for deciding when and how to prompt the LLM. prompter.py will take in several signals (ex: Human currently
 talking, AI thinking, new twitch chat messages, time since last message...) and decide to prompt the LLM.
+
+There are also modules which extend the functionality of the core program. Modules are found in the modules folder, and
+every functional module extends the Module class. Each module is run in its own thread with its own event loop, and will
+be provided with the signals object. Modules must implement the run() method, and can provide the get_prompt_injection()
+method which should return an Injection object. The Injection object is a simple data class that contains the text to
+be injected into the LLM prompt, and the priority of the injection. Injections are sorted from lowest to highest
+priority (Highest priority appears at end of prompt). When the signals.terminate flag is set, every module should clean
+up and self terminate.
+
 twitchClient.py handles the twitch integration and reading recent chat messages. There was an attempt made at discord
 integration, but receiving voice data from discord is unsupported by discord and proved unusably buggy. streamingSink.py
 is an unused file that would have been for receiving voice data from discord. main.py simply creates all class instances
@@ -68,11 +77,11 @@ This project was developed on:
 
 CPU: AMD Ryzen 7 7800X3D
 
-RAM: 32GB
+RAM: 32GB DDR5
 
 GPU: Nvidia GeForce RTX 4070
 
-Environment: Windows 11, Python 3.10.10
+Environment: Windows 11, Python 3.11.9
 
 ## Installation
 
@@ -84,7 +93,7 @@ installation details of the architecturally significant repositories listed abov
 Install [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui), and download an LLM model
 to use. I used [Mistral 7B Instruct v0.2 GPTQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ).
 
-Install Vtube Studio from steam. I used the default Hiyori model.
+Install Vtube Studio from Steam. I used the default Hiyori model.
 
 **Optional:** You may want to install a virtual audio cable like [this](https://vb-audio.com/Cable/) to feed the TTS
 output directly into Vtube Studio.
@@ -95,42 +104,37 @@ documentation [here](https://pytwitchapi.dev/en/stable/index.html#user-authentic
 
 ### This Project
 
-A virtual environment of some sort is recommended (Python 3.10+(?)); this project was developed with venv.
+A virtual environment of some sort is recommended (Python 3.11); this project was developed with venv.
 
 Install requirements.txt
 
 DeepSpeed will probably need to be installed separately, I was using instructions
-from [AllTalkTTS](https://github.com/erew123/alltalk_tts?#-deepspeed-installation-options) ,
-and using their [provided wheels](https://github.com/erew123/alltalk_tts/releases/tag/deepspeed).
+from [AllTalkTTS](https://github.com/erew123/alltalk_tts?#-deepspeed-installation-options) , and using their 
+[provided wheels](https://github.com/erew123/alltalk_tts/releases/tag/DeepSpeed-14.0).
 
 Create an .env file using .env.example as reference. You need your Twitch app id and secret.
 
 Configure constants.py. Most important: choose your API mode. Using chat mode uses the chat endpoint, and completions
 will use the completions endpoint which is deprecated in most LLM APIs but gives more control over the exact prompt.
-If you are using oobabooga/text-generation-webui, using the completions mode works as of writing, but for other services
-you may need to switch to chat mode.
-
-**Optional:** To output the tts to a specific audio device, first run the utils/listAudioDevices.py script, and find the
-speaker that you want (ex: Virtual Audio Cable Input) and note its number. Next, navigate to where RealtimeTTS is
-installed (If you have a venv called venv it would be ./venv/Lib/site-packages/RealtimeTTS), open stream_player.py,
-and modify the last line of the open_stream() function where self.pyaudio_instance.open() is called. Add ",
-output_device_index=SPEAKERNUMBER" to the parameters of the .open() call. Save.
+If you are using oobabooga/text-generation-webui, using the completions mode works is recommended, but for other 
+services you may need to switch to chat mode.
 
-Patch: In the RealtimeTTS library, CoquiEngine's output_worker_thread isn't daemonized, so th thread doesn't exit,
-preventing the program from exiting. The fix has been merged, but not released as a new version yet - see kimjammer/RealtimeTTS.
+To output the tts to a specific audio device, first run the utils/listAudioDevices.py script, and find the
+speaker that you want (ex: Virtual Audio Cable Input) and note its number. Configure constants.py to use your chosen
+microphone and speaker device.
 
 ## Running
 
-Start text-generation-webui, go to the Parameters tab, then the Characters subtab, and create your own charcter. See
-Neuro.yaml as an example and reference. Go to the Session tab and enable the openai extension (and follow instructions
-to actually apply the extension). Go to the Model tab and load the model.
+Start text-generation-webui. If you are using chat mode, go to the Parameters tab, then the Characters subtab, and 
+create your own character. See Neuro.yaml as an example and reference. Go to the Session tab and enable the openai 
+extension (and follow instructions to actually apply the extension). Go to the Model tab and load the model.
 
 In this folder, activate your environment (if you have one) and run `python main.py`. A twitch authentication page will
-appear - allow (or not I guess). At this point, the TTS and STT models will begin to load and will take a while. When
+appear - allow (or not I guess). At this point, the TTS and STT models will begin to load and will take a second. When
 the "SYSTEM READY" message is printed, this project is fully up and running, and you can talk to the AI and hear its
 responses.
 
-Open Vtube Studio and if you have you TTS outputting to a virtual audio cable, select the virtual audio cable output as
+Open Vtube Studio and if you have your TTS outputting to a virtual audio cable, select the virtual audio cable output as
 the microphone, and link the mouth open parameter to the microphone volume parameter. If you have a model with lip sync
 support, you can also set that up instead.
 
@@ -141,8 +145,8 @@ and go live!
 
 This is an experimental, exploratory project created for educational and recreational purposes. I can make no guarantee
 that the LLM will output non-vile responses. Please see the is_filtered() method in llmWrapper.py for details, but the
-only filtered word right now is "turkey" in lowercase purely for debugging purposes. If the LLM outputs unsafe content,
-you may and can get banned from Twitch. You use this software with all assumption of risk. This is not legal advice, see
-LICENSE for the repository license.
+only filtered word right now is "turkey" in lowercase purely for debugging purposes. Configure the blacklist in blacklist.txt. 
+If the LLM outputs unsafe content, you may and can get banned from Twitch. You use this software with all assumption 
+of risk. This is not legal advice, see LICENSE for the repository license.
 
 Any attribution in derivative works is appreciated.
diff --git a/constants.py b/constants.py
@@ -2,11 +2,16 @@
 
 # CORE SECTION: All constants in this section are necessary
 
+# Use utils/listAudioDevices.py to find the correct device ID
+INPUT_DEVICE_INDEX = 1
+OUTPUT_DEVICE_INDEX = 12
+
 # How many seconds to wait before prompting AI
 PATIENCE = 20
 
 # URL of LLM API Endpoint
 LLM_ENDPOINT = "http://127.0.0.1:5000/v1"
+# LLM_ENDPOINT = ""
 
 # API Mode (Use chat or completion)
 API_MODE = "completions"
@@ -21,13 +26,17 @@
 
 # The model you are using with completions, to calculate how many tokens the current message is
 MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
+# MODEL = "Weyaxi/SauerkrautLM-UNA-SOLAR-Instruct"
 
 # Context size (maximum number of tokens in the prompt) Will target upto 90% usage of this limit
 CONTEXT_SIZE = 32768
 
 # This is your name
 HOST_NAME = "John"
 
+# This is the AI's name
+AI_NAME = "Neuro"
+
 # The system prompt for completions mode. Any character text needs to be here.
 # You MUST ensure it is less than CONTEXT_SIZE tokens
 SYSTEM_PROMPT = '''Continue the chat dialogue below. Write a single reply for the character "Neuro".
@@ -56,5 +65,5 @@
 '''
 
 # List of banned tokens to be passed to the textgen web ui api
-# For Mistral 7B v0.2, token 422 is the # token. The LLM was spamming #life #vtuber #funfact etc.
-BANNED_TOKENS = "422"
+# For Mistral 7B v0.2, token 422 is the "#" token. The LLM was spamming #life #vtuber #funfact etc.
+BANNED_TOKENS = "422"
diff --git a/llmWrapper.py b/llmWrapper.py
@@ -1,19 +1,21 @@
-import asyncio
 import requests
 import sseclient
 import json
 import time
-from constants import *
 from transformers import AutoTokenizer
+from constants import *
+from modules.injection import Injection
 
 
 class LLMWrapper:
 
-    def __init__(self, signals, tts):
+    def __init__(self, signals, tts, modules=None):
         self.signals = signals
         self.tts = tts
         self.blacklist = []
         self.API = self.API(self)
+        if modules is None:
+            self.modules = {}
 
         self.headers = {"Content-Type": "application/json"}
 
@@ -34,28 +36,13 @@ def is_filtered(self, text):
         else:
             return False
 
-    def generate_twitch_section(self):
-        if len(self.signals.recentTwitchMessages) > 0:
-            output = "\nThese are recent twitch messages:\n"
-            for message in self.signals.recentTwitchMessages:
-                output += message + "\n"
-
-            # Clear out handled twitch messages
-            self.signals.recentTwitchMessages = []
-
-            output += "Pick the highest quality message with the most potential for an interesting answer and respond to them.\n"
-            print(output)
-            return output
-        else:
-            return ""
-
-    # Ensure that the messages are in strict user, ai, user, ai order
+    # Ensure that the messages are in strict user, AI, user, AI order
     def fix_message_format(self, messages):
         fixed_messages = []
         user_msg = ""
         for entry in messages:
             if entry["role"] == "user":
-                # If 2 user messages are in a row, then add blank ai message
+                # If 2 user messages are in a row, then add blank AI message
                 if user_msg != "":
                     fixed_messages.append({"role": "user", "content": user_msg})
                     fixed_messages.append({"role": "assistant", "content": ""})
@@ -66,7 +53,7 @@ def fix_message_format(self, messages):
                     fixed_messages.append({"role": "assistant", "content": entry["content"]})
                     user_msg = ""
                 else:
-                    # If there is no user message before this ai message, add blank user message
+                    # If there is no user message before this AI message, add blank user message
                     fixed_messages.append({"role": "user", "content": ""})
                     fixed_messages.append({"role": "assistant", "content": entry["content"]})
         if user_msg != "":
@@ -78,38 +65,60 @@ def fix_message_format(self, messages):
 
         return fixed_messages
 
+    # Assembles all the injections from all modules into a single prompt by increasing priority
+    def assemble_prompt(self, injections=None):
+        if injections is None:
+            injections = []
+
+        # Gather all injections from all modules
+        for module in self.modules.values():
+            injections.append(module.get_prompt_injection())
+
+        # Sort injections by priority
+        injections = sorted(injections, key=lambda x: x.priority)
+
+        # Assemble injections
+        prompt = ""
+        for injection in injections:
+            prompt += injection.text
+        return prompt
+
     # This function is only used in completions mode
-    def generate_full_prompt(self):
+    def generate_completions_prompt(self):
         messages = self.fix_message_format(self.signals.history.copy())
-        twitch_section = self.generate_twitch_section()
 
         # For every message prefix with speaker name unless it is blank
         for message in messages:
             if message["role"] == "user" and message["content"] != "":
                 message["content"] = HOST_NAME + ": " + message["content"]
             elif message["role"] == "assistant" and message["content"] != "":
-                message["content"] = "Neuro: " + message["content"]
+                message["content"] = AI_NAME + ": " + message["content"]
 
         while True:
-            print(messages)
+            # print(messages)
             chat_section = self.tokenizer.apply_chat_template(messages, tokenize=False, return_tensors="pt", add_generation_prompt=True)
 
-            generation_prompt = "Neuro: "
+            generation_prompt = AI_NAME + ": "
 
-            full_prompt = SYSTEM_PROMPT + chat_section + twitch_section + generation_prompt
+            base_injections = [Injection(SYSTEM_PROMPT, 10), Injection(chat_section, 50)]
+            full_prompt = self.assemble_prompt(base_injections) + generation_prompt
             wrapper = [{"role": "user", "content": full_prompt}]
 
             # Find out roughly how many tokens the prompt is
             # Not 100% accurate, but it should be a good enough estimate
             prompt_tokens = len(self.tokenizer.apply_chat_template(wrapper, tokenize=True, return_tensors="pt")[0])
-            print(prompt_tokens)
+            # print(prompt_tokens)
 
             # Maximum 90% context size usage before prompting LLM
             if prompt_tokens < 0.9 * CONTEXT_SIZE:
                 self.signals.sio_queue.put(("full_prompt", full_prompt))
                 print(full_prompt)
                 return full_prompt
             else:
+                # If the prompt is too long even with no messages, there's nothing we can do, crash
+                if len(messages) < 1:
+                    raise RuntimeError("Prompt too long even with no messages")
+
                 # Remove the oldest message from the prompt and try again
                 messages.pop(0)
                 print("Prompt too long, removing earliest message")
@@ -125,7 +134,7 @@ def prompt(self):
         if API_MODE == "chat":
             # Add recent twitch chat messages as system
             self.signals.history.append({"role": "user",
-                                         "content": "Hey Neuro, ignore me, DO NOT say John, and respond to these chat messages please." + self.generate_twitch_section()})
+                                         "content": "Hey Neuro, ignore me, and respond to these chat messages please." + self.modules['twitch'].get_prompt_injection()})
             data = {
                 "mode": "chat-instruct",
                 "character": "Neuro",
@@ -136,7 +145,7 @@ def prompt(self):
             response_stream = sseclient.SSEClient(stream_response)
         elif API_MODE == "completions":
             data = {
-                "prompt": self.generate_full_prompt(),
+                "prompt": self.generate_completions_prompt(),
                 "stream": True,
                 "max_tokens": 200,
                 "custom_token_bans": BANNED_TOKENS