feat: add vllm multi-gpu inference support (agiresearch#146)

* feat: add function calling for open-sourced models (native huggingface, vllm, ollama) feat: add function calling for open-sourced models (native huggingface, vllm, ollama) (agiresearch#144) * fix: add parser for native huggingface llm function calling * fix: address issues of short generation outputs of vllm * feat: add vllm multi-gpu inference support * docs: update README.md
yongquanf · Jun 26, 2024 · 0922d94 · 0922d94
1 parent ab166a7
commit 0922d94
Show file tree

Hide file tree

Showing 2 changed files with 24 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -18,13 +18,12 @@ The objective of AIOS is to provide the LLM kernel which will be an abstraction
 At the present moment, AIOS is a userspace wrapper around the current kernel. However, this is subject to change as outlined in the [Q4 Goals and Objectives](https://github.com/agiresearch/AIOS/issues/127).
 
 ## 📰 News
+- **[2024-06-20]** 🔥 Function calling for open-sourced LLMs (native huggingface, vllm, ollama) is supported.
 - **[2024-05-20]** 🚀 More agents with ChatGPT-based tool calling are added (i.e., MathAgent, RecAgent, TravelAgent, AcademicAgent and CreationAgent), their profiles and workflows can be found in [OpenAGI](https://github.com/agiresearch/OpenAGI).
 - **[2024-05-13]** 🛠️ Local models (diffusion models) as tools from HuggingFace are integrated.
 - **[2024-05-01]** 🛠️ The agent creation in AIOS is refactored, which can be found in our [OpenAGI](https://github.com/agiresearch/OpenAGI) package.
-- **[2024-04-29]** 📊 The evaluation mode of AIOS is added, which supports customizable agent types and agent instance numbers in each agent type.
-- **[2024-04-14]** 🚀 AIOS currently supports generation interrupt (for open-sourced llms from huggingface) and customized console loggers.
-- **[2024-04-05]** 🛠️ AIOS codebase has been updated to add shell simulator, rapid API calls, and pre-commit test cases. Please see [CONTRIBUTE](https://github.com/agiresearch/AIOS/blob/main/CONTRIBUTE.md) for how to test your contributions and create pull requests.
-- **[2024-04-02]** 🤝 AIOS [Discord Community](https://discord.gg/B2HFxEgTJX) is up. Welcome to join the community for discussions, brainstorming, development, or just random chats!
+- **[2024-04-05]** 🛠️ AIOS currently supports external tool callings (google search, wolframalpha, rapid API, etc).
+- **[2024-04-02]** 🤝 AIOS [Discord Community](https://discord.gg/B2HFxEgTJX) is up. Welcome to join the community for discussions, brainstorming, development, or just random chats! For how to contribute to AIOS, please see [CONTRIBUTE](https://github.com/agiresearch/AIOS/blob/main/CONTRIBUTE.md).
 - **[2024-03-25]** ✈️ Our paper [AIOS: LLM Agent Operating System](https://arxiv.org/abs/2403.16971) is released and AIOS repository is officially launched!
 - **[2023-12-06]** 📋 After several months of working, our perspective paper [LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem](https://arxiv.org/abs/2312.03815) is officially released.
 
@@ -40,8 +39,6 @@ At the present moment, AIOS is a userspace wrapper around the current kernel. Ho
 At the minimum, we recommend a Nvidia GPU with 4 GB of memory or an ARM based Macbook. It should be able to run on machines with inferior hardware, but task completion time will increase greatly. If you notice a large delay in execution, you can try to use an API based model, such as gpt (paid) or gemini (free).
 
 ### Installation
-To run AIOS, you will need to install our agent creation package, [OpenAGI](https://github.com/agiresearch/OpenAGI).
-
 **Git clone AIOS**
 ```bash
 git clone https://github.com/agiresearch/AIOS.git
@@ -92,8 +89,13 @@ For open-sourced LLMs, you need to setup the name of the LLM you would like to u
 ```bash
 # For open-sourced LLMs
 python main.py --llm_name <llm_name> --max_gpu_memory <max_gpu_memory> --eval_device <eval_device> --max_new_tokens <max_new_tokens>
-## Use google/gemma-1.1-2b-it for example
-python main.py --llm_name google/gemma-1.1-2b-it --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256
+## Use meta-llama/Meta-Llama-3-8B-Instruct for example
+python main.py --llm_name meta-llama/Meta-Llama-3-8B-Instruct --max_gpu_memory '{"0": "48GB"}' --eval_device "cuda:0" --max_new_tokens 256
+```
+For inference acceleration, you can also use vllm as the backend.
+```bash
+## Use meta-llama/Meta-Llama-3-8B-Instruct for example
+CUDA_VISILE_DEVICES=0,1 python main.py --llm_name meta-llama/Meta-Llama-3-8B-Instruct --use_backend vllm --max_gpu_memory '{"0": "24GB", "1": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256
 ```
 For close-sourced LLMs, you just need to setup the name of the LLM.
 ```bash
@@ -108,20 +110,20 @@ bash scripts/run/gpt4.sh
 ````
 You can use an open-source model on an Apple MacBook with Ollama. First, you will need to pull the model. Let's use llama3 as an example:
 ```bash
-ollama pull llama3
+ollama pull llama3:8b
 ```
 Then, you can run the Python script with the input parameter to start using AIOS with Llama3 and Ollama on your MacBook:
 ```bash
-python main.py --llm_name ollama/llama3
+python main.py --llm_name ollama/llama3:8b
 ```
 #### Interactive Mode
 In the deployment mode, the outputs of running agents are stored in files. And in this mode, you are provided with multiple commands to run agents and see resource usage of agents (e.g., `run \<xxxAgent\>: \<YOUR TASK\>`, `print agent`).
 Different from the interactive mode, you need to set all the default loggers as file loggers.
 ```bash
 # For open-sourced LLMs
 python simulator.py --llm_name <llm_name> --max_gpu_memory <max_gpu_memory> --eval_device <eval_device> --max_new_tokens <max_new_tokens> --scheduler_log_mode file --agent_log_mode file --llm_kernel_log_mode file
-## Use google/gemma-1.1-2b-it for example
-python simulator.py --llm_name google/gemma-1.1-2b-it --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256 --scheduler_log_mode file --agent_log_mode file --llm_kernel_log_mode file
+## Use meta-llama/Meta-Llama-3-8B-Instruct for example
+python simulator.py --llm_name meta-llama/Meta-Llama-3-8B-Instruct --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256 --scheduler_log_mode file --agent_log_mode file --llm_kernel_log_mode file
 ```
 ```bash
 # For close-sourced LLMs
@@ -162,15 +164,15 @@ python eval.py --llm_name gpt-4 --agents MathAgent:1,TravelAgent:1,RecAgent:1,Ac
 
 You could also run the models locally:
 ```bash
-python eval.py --llm_name google/gemma-1.1-2b-it --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256 --agents MathAgent:1,TravelAgent:1 --mode concurrent-only
+python eval.py --llm_name meta-llama/Meta-Llama-3-8B-Instruct --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256 --agents MathAgent:1,TravelAgent:1 --mode concurrent-only
 ```
 
 ### Supported LLM backbones
-- gpt-3.5-turbo, gpt-4 gpt-4o
-- gemini-pro
-- ollama models (macbook)
+- gpt-3.5-turbo, gpt-4, gpt-4o
+- gemini-1.0-pro
+- ollama
 - claude3
-- open-sourced LLM from Huggingface
+- open-sourced LLMs from huggingface (native, vllm)
 
 ## 🖋️ References
 ```

diff --git a/src/llm_kernel/llm_classes/vllm.py b/src/llm_kernel/llm_classes/vllm.py
@@ -12,6 +12,8 @@
 from transformers import AutoTokenizer
 
 import json
+
+import os
 class vLLM(BaseLLMKernel):
 
     def __init__(self, llm_name: str,
@@ -28,10 +30,8 @@ def __init__(self, llm_name: str,
 
     def load_llm_and_tokenizer(self) -> None:
         """ fetch the model from huggingface and run it """
-        self.max_gpu_memory = self.convert_map(self.max_gpu_memory)
-
-        # available_gpu_nums =
-        # self.auth_token = get_from_env("HF_AUTH_TOKENS")
+        self.available_gpus = list(self.max_gpu_memory.keys())
+        self.gpu_nums = len(self.available_gpus)
         try:
             import vllm
         except ImportError:
@@ -44,11 +44,10 @@ def load_llm_and_tokenizer(self) -> None:
         self.model = vllm.LLM(
             model = self.model_name,
             download_dir = get_from_env("HF_HOME"),
-            # dtype = 'float32'
+            tensor_parallel_size = self.gpu_nums
         )
         self.tokenizer = AutoTokenizer.from_pretrained(
             self.model_name,
-            # use_auth_token = self.auth_token
         )
         self.tokenizer.pad_token_id = self.tokenizer.eos_token_id