Skip to content

Commit

Permalink
Environment upgrade & Modules rework
Browse files Browse the repository at this point in the history
Python to 3.11
Torch to 2.2.2
Deepspeed updated to match new environment
Modules rework so that its easier to inject text into the prompt
  • Loading branch information
kimjammer committed Apr 9, 2024
1 parent f1372f6 commit 1f93c1e
Show file tree
Hide file tree
Showing 17 changed files with 209 additions and 171 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
venv/
models/
voices/
deepspeed-0.11.2+cuda118-cp310-cp310-win_amd64.whl
deepspeed-0.14.0+cu118-cp311-cp311-win_amd64.whl
.fleet/
__pycache__/
.env
Expand Down
64 changes: 34 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@ running [Mistral 7B Instruct v0.2 GPTQ](https://huggingface.co/TheBloke/Mistral-
ExLlamaV2_HF loader with cache_8bit turned on. The openai api extension must be turned on, as this is how we interact
with the LLM. text-generation-webui and the LLM must be installed and started separately.

Alternatively, you can load any other model into text-generation-webui or modify llmWrapper.py to point to any other
openapi compatible endpoint.
Alternatively, you can load any other model into text-generation-webui or modify constants.py to point to any other
openapi compatible endpoint. Note that this project uses some parameters only on text-generation-webui.

### STT

This project uses the excellent [KoljaB/RealtimeSTT](https://github.com/KoljaB/RealtimeSTT), which can transcribe an
incoming audio stream, not just a file. This means that the text is transcribed as the person is talking, and so
transcription ends almost immedeatly after speech ends. It is configured to use the faster_whisper tiny.en model.
transcription ends almost immediately after speech ends. It is configured to use the faster_whisper tiny.en model.

### TTS

Expand All @@ -41,11 +41,20 @@ the Installation Section for more details.

### Modularization

Each concern of the program is separated out into its own python file. A single signals object is created and passed to
every module, and each module can read and write to the same signals object to share state and data. tts.py and stt.py
handle the TTS and STT, the llmWrapper.py is responsible for interfacing with the LLM API, and prompter.py is
Each concern of the program is separated out into its own python file/class. A single signals object is created and
passed to every class, and each class can read and write to the same signals object to share state and data. tts.py and
stt.py handle the TTS and STT, the llmWrapper.py is responsible for interfacing with the LLM API, and prompter.py is
responsible for deciding when and how to prompt the LLM. prompter.py will take in several signals (ex: Human currently
talking, AI thinking, new twitch chat messages, time since last message...) and decide to prompt the LLM.

There are also modules which extend the functionality of the core program. Modules are found in the modules folder, and
every functional module extends the Module class. Each module is run in its own thread with its own event loop, and will
be provided with the signals object. Modules must implement the run() method, and can provide the get_prompt_injection()
method which should return an Injection object. The Injection object is a simple data class that contains the text to
be injected into the LLM prompt, and the priority of the injection. Injections are sorted from lowest to highest
priority (Highest priority appears at end of prompt). When the signals.terminate flag is set, every module should clean
up and self terminate.

twitchClient.py handles the twitch integration and reading recent chat messages. There was an attempt made at discord
integration, but receiving voice data from discord is unsupported by discord and proved unusably buggy. streamingSink.py
is an unused file that would have been for receiving voice data from discord. main.py simply creates all class instances
Expand All @@ -68,11 +77,11 @@ This project was developed on:

CPU: AMD Ryzen 7 7800X3D

RAM: 32GB
RAM: 32GB DDR5

GPU: Nvidia GeForce RTX 4070

Environment: Windows 11, Python 3.10.10
Environment: Windows 11, Python 3.11.9

## Installation

Expand All @@ -84,7 +93,7 @@ installation details of the architecturally significant repositories listed abov
Install [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui), and download an LLM model
to use. I used [Mistral 7B Instruct v0.2 GPTQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ).

Install Vtube Studio from steam. I used the default Hiyori model.
Install Vtube Studio from Steam. I used the default Hiyori model.

**Optional:** You may want to install a virtual audio cable like [this](https://vb-audio.com/Cable/) to feed the TTS
output directly into Vtube Studio.
Expand All @@ -95,42 +104,37 @@ documentation [here](https://pytwitchapi.dev/en/stable/index.html#user-authentic

### This Project

A virtual environment of some sort is recommended (Python 3.10+(?)); this project was developed with venv.
A virtual environment of some sort is recommended (Python 3.11); this project was developed with venv.

Install requirements.txt

DeepSpeed will probably need to be installed separately, I was using instructions
from [AllTalkTTS](https://github.com/erew123/alltalk_tts?#-deepspeed-installation-options) ,
and using their [provided wheels](https://github.com/erew123/alltalk_tts/releases/tag/deepspeed).
from [AllTalkTTS](https://github.com/erew123/alltalk_tts?#-deepspeed-installation-options) , and using their
[provided wheels](https://github.com/erew123/alltalk_tts/releases/tag/DeepSpeed-14.0).

Create an .env file using .env.example as reference. You need your Twitch app id and secret.

Configure constants.py. Most important: choose your API mode. Using chat mode uses the chat endpoint, and completions
will use the completions endpoint which is deprecated in most LLM APIs but gives more control over the exact prompt.
If you are using oobabooga/text-generation-webui, using the completions mode works as of writing, but for other services
you may need to switch to chat mode.

**Optional:** To output the tts to a specific audio device, first run the utils/listAudioDevices.py script, and find the
speaker that you want (ex: Virtual Audio Cable Input) and note its number. Next, navigate to where RealtimeTTS is
installed (If you have a venv called venv it would be ./venv/Lib/site-packages/RealtimeTTS), open stream_player.py,
and modify the last line of the open_stream() function where self.pyaudio_instance.open() is called. Add ",
output_device_index=SPEAKERNUMBER" to the parameters of the .open() call. Save.
If you are using oobabooga/text-generation-webui, using the completions mode works is recommended, but for other
services you may need to switch to chat mode.

Patch: In the RealtimeTTS library, CoquiEngine's output_worker_thread isn't daemonized, so th thread doesn't exit,
preventing the program from exiting. The fix has been merged, but not released as a new version yet - see kimjammer/RealtimeTTS.
To output the tts to a specific audio device, first run the utils/listAudioDevices.py script, and find the
speaker that you want (ex: Virtual Audio Cable Input) and note its number. Configure constants.py to use your chosen
microphone and speaker device.

## Running

Start text-generation-webui, go to the Parameters tab, then the Characters subtab, and create your own charcter. See
Neuro.yaml as an example and reference. Go to the Session tab and enable the openai extension (and follow instructions
to actually apply the extension). Go to the Model tab and load the model.
Start text-generation-webui. If you are using chat mode, go to the Parameters tab, then the Characters subtab, and
create your own character. See Neuro.yaml as an example and reference. Go to the Session tab and enable the openai
extension (and follow instructions to actually apply the extension). Go to the Model tab and load the model.

In this folder, activate your environment (if you have one) and run `python main.py`. A twitch authentication page will
appear - allow (or not I guess). At this point, the TTS and STT models will begin to load and will take a while. When
appear - allow (or not I guess). At this point, the TTS and STT models will begin to load and will take a second. When
the "SYSTEM READY" message is printed, this project is fully up and running, and you can talk to the AI and hear its
responses.

Open Vtube Studio and if you have you TTS outputting to a virtual audio cable, select the virtual audio cable output as
Open Vtube Studio and if you have your TTS outputting to a virtual audio cable, select the virtual audio cable output as
the microphone, and link the mouth open parameter to the microphone volume parameter. If you have a model with lip sync
support, you can also set that up instead.

Expand All @@ -141,8 +145,8 @@ and go live!

This is an experimental, exploratory project created for educational and recreational purposes. I can make no guarantee
that the LLM will output non-vile responses. Please see the is_filtered() method in llmWrapper.py for details, but the
only filtered word right now is "turkey" in lowercase purely for debugging purposes. If the LLM outputs unsafe content,
you may and can get banned from Twitch. You use this software with all assumption of risk. This is not legal advice, see
LICENSE for the repository license.
only filtered word right now is "turkey" in lowercase purely for debugging purposes. Configure the blacklist in blacklist.txt.
If the LLM outputs unsafe content, you may and can get banned from Twitch. You use this software with all assumption
of risk. This is not legal advice, see LICENSE for the repository license.

Any attribution in derivative works is appreciated.
13 changes: 11 additions & 2 deletions constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,16 @@

# CORE SECTION: All constants in this section are necessary

# Use utils/listAudioDevices.py to find the correct device ID
INPUT_DEVICE_INDEX = 1
OUTPUT_DEVICE_INDEX = 12

# How many seconds to wait before prompting AI
PATIENCE = 20

# URL of LLM API Endpoint
LLM_ENDPOINT = "http://127.0.0.1:5000/v1"
# LLM_ENDPOINT = ""

# API Mode (Use chat or completion)
API_MODE = "completions"
Expand All @@ -21,13 +26,17 @@

# The model you are using with completions, to calculate how many tokens the current message is
MODEL = "mistralai/Mistral-7B-Instruct-v0.2"
# MODEL = "Weyaxi/SauerkrautLM-UNA-SOLAR-Instruct"

# Context size (maximum number of tokens in the prompt) Will target upto 90% usage of this limit
CONTEXT_SIZE = 32768

# This is your name
HOST_NAME = "John"

# This is the AI's name
AI_NAME = "Neuro"

# The system prompt for completions mode. Any character text needs to be here.
# You MUST ensure it is less than CONTEXT_SIZE tokens
SYSTEM_PROMPT = '''Continue the chat dialogue below. Write a single reply for the character "Neuro".
Expand Down Expand Up @@ -56,5 +65,5 @@
'''

# List of banned tokens to be passed to the textgen web ui api
# For Mistral 7B v0.2, token 422 is the # token. The LLM was spamming #life #vtuber #funfact etc.
BANNED_TOKENS = "422"
# For Mistral 7B v0.2, token 422 is the "#" token. The LLM was spamming #life #vtuber #funfact etc.
BANNED_TOKENS = "422"
69 changes: 39 additions & 30 deletions llmWrapper.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
import asyncio
import requests
import sseclient
import json
import time
from constants import *
from transformers import AutoTokenizer
from constants import *
from modules.injection import Injection


class LLMWrapper:

def __init__(self, signals, tts):
def __init__(self, signals, tts, modules=None):
self.signals = signals
self.tts = tts
self.blacklist = []
self.API = self.API(self)
if modules is None:
self.modules = {}

self.headers = {"Content-Type": "application/json"}

Expand All @@ -34,28 +36,13 @@ def is_filtered(self, text):
else:
return False

def generate_twitch_section(self):
if len(self.signals.recentTwitchMessages) > 0:
output = "\nThese are recent twitch messages:\n"
for message in self.signals.recentTwitchMessages:
output += message + "\n"

# Clear out handled twitch messages
self.signals.recentTwitchMessages = []

output += "Pick the highest quality message with the most potential for an interesting answer and respond to them.\n"
print(output)
return output
else:
return ""

# Ensure that the messages are in strict user, ai, user, ai order
# Ensure that the messages are in strict user, AI, user, AI order
def fix_message_format(self, messages):
fixed_messages = []
user_msg = ""
for entry in messages:
if entry["role"] == "user":
# If 2 user messages are in a row, then add blank ai message
# If 2 user messages are in a row, then add blank AI message
if user_msg != "":
fixed_messages.append({"role": "user", "content": user_msg})
fixed_messages.append({"role": "assistant", "content": ""})
Expand All @@ -66,7 +53,7 @@ def fix_message_format(self, messages):
fixed_messages.append({"role": "assistant", "content": entry["content"]})
user_msg = ""
else:
# If there is no user message before this ai message, add blank user message
# If there is no user message before this AI message, add blank user message
fixed_messages.append({"role": "user", "content": ""})
fixed_messages.append({"role": "assistant", "content": entry["content"]})
if user_msg != "":
Expand All @@ -78,38 +65,60 @@ def fix_message_format(self, messages):

return fixed_messages

# Assembles all the injections from all modules into a single prompt by increasing priority
def assemble_prompt(self, injections=None):
if injections is None:
injections = []

# Gather all injections from all modules
for module in self.modules.values():
injections.append(module.get_prompt_injection())

# Sort injections by priority
injections = sorted(injections, key=lambda x: x.priority)

# Assemble injections
prompt = ""
for injection in injections:
prompt += injection.text
return prompt

# This function is only used in completions mode
def generate_full_prompt(self):
def generate_completions_prompt(self):
messages = self.fix_message_format(self.signals.history.copy())
twitch_section = self.generate_twitch_section()

# For every message prefix with speaker name unless it is blank
for message in messages:
if message["role"] == "user" and message["content"] != "":
message["content"] = HOST_NAME + ": " + message["content"]
elif message["role"] == "assistant" and message["content"] != "":
message["content"] = "Neuro: " + message["content"]
message["content"] = AI_NAME + ": " + message["content"]

while True:
print(messages)
# print(messages)
chat_section = self.tokenizer.apply_chat_template(messages, tokenize=False, return_tensors="pt", add_generation_prompt=True)

generation_prompt = "Neuro: "
generation_prompt = AI_NAME + ": "

full_prompt = SYSTEM_PROMPT + chat_section + twitch_section + generation_prompt
base_injections = [Injection(SYSTEM_PROMPT, 10), Injection(chat_section, 50)]
full_prompt = self.assemble_prompt(base_injections) + generation_prompt
wrapper = [{"role": "user", "content": full_prompt}]

# Find out roughly how many tokens the prompt is
# Not 100% accurate, but it should be a good enough estimate
prompt_tokens = len(self.tokenizer.apply_chat_template(wrapper, tokenize=True, return_tensors="pt")[0])
print(prompt_tokens)
# print(prompt_tokens)

# Maximum 90% context size usage before prompting LLM
if prompt_tokens < 0.9 * CONTEXT_SIZE:
self.signals.sio_queue.put(("full_prompt", full_prompt))
print(full_prompt)
return full_prompt
else:
# If the prompt is too long even with no messages, there's nothing we can do, crash
if len(messages) < 1:
raise RuntimeError("Prompt too long even with no messages")

# Remove the oldest message from the prompt and try again
messages.pop(0)
print("Prompt too long, removing earliest message")
Expand All @@ -125,7 +134,7 @@ def prompt(self):
if API_MODE == "chat":
# Add recent twitch chat messages as system
self.signals.history.append({"role": "user",
"content": "Hey Neuro, ignore me, DO NOT say John, and respond to these chat messages please." + self.generate_twitch_section()})
"content": "Hey Neuro, ignore me, and respond to these chat messages please." + self.modules['twitch'].get_prompt_injection()})
data = {
"mode": "chat-instruct",
"character": "Neuro",
Expand All @@ -136,7 +145,7 @@ def prompt(self):
response_stream = sseclient.SSEClient(stream_response)
elif API_MODE == "completions":
data = {
"prompt": self.generate_full_prompt(),
"prompt": self.generate_completions_prompt(),
"stream": True,
"max_tokens": 200,
"custom_token_bans": BANNED_TOKENS
Expand Down
Loading

0 comments on commit 1f93c1e

Please sign in to comment.