Fixes for running on T4 GPU #3

DmitryKey · 2024-09-14T13:13:10Z

Hello,

Thanks for open-sourcing the rStar!

I gave it a spin on Google Collab with T4 GPU (15G GPU memory) and had to make several changes, included in this PR. Hopefully these would be useful for those who try to run the code.

The following changes had to be made to meet the hardware requirements, but I'm not committing them in the code, as specific hardware may vary.

scripts/run_gsm8k_generator.sh: add the parameter --half_precision so that the file looks like this:

CUDA_VISIBLE_DEVICES=0 python run_src/do_generate.py \
    --dataset_name GSM8K \
    --test_json_filename test_all \
    --model_ckpt mistralai/Mistral-7B-v0.1 \
    --note default \
    --num_rollouts 16 \
    --half_precision

In models/vLLM_API.py change the swap space from 16 to 4 to avoid "Too large swap space": in LLM class add swap_space=4, so the full file looks like this:

# Licensed under the MIT license.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import numpy as np
import math


def load_vLLM_model(model_ckpt, seed, tensor_parallel_size=1, half_precision=False, max_num_seqs=256):
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

    if half_precision:
        llm = LLM(
            model=model_ckpt,
            dtype="half",
            tensor_parallel_size=tensor_parallel_size,
            seed=seed,
            trust_remote_code=True,
            max_num_seqs=max_num_seqs,
            swap_space=4,
        )
    else:
        llm = LLM(
            model=model_ckpt,
            tensor_parallel_size=tensor_parallel_size,
            seed=seed,
            trust_remote_code=True,
            max_num_seqs=max_num_seqs,
            swap_space=4,
        )

    return tokenizer, llm


def generate_with_vLLM_model(
    model,
    input,
    temperature=0.8,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1,
    n=1,
    max_tokens=256,
    logprobs=1,
    stop=[],
):
    sampling_params = SamplingParams(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        repetition_penalty=repetition_penalty,
        n=n,
        logprobs=logprobs,
        max_tokens=max_tokens,
        stop=stop,
    )

    output = model.generate(input, sampling_params, use_tqdm=False)
    return output


if __name__ == "__main__":
    model_ckpt = "mistralai/Mistral-7B-v0.1"
    tokenizer, model = load_vLLM_model(model_ckpt, seed=42, tensor_parallel_size=1, half_precision=False)
    input = "What is the meaning of life?"
    output = generate_with_vLLM_model(model, input)
    breakpoint()
    print(output[0].outputs[0].text)

export the following variable to help with CUDA running out of memory:

!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments

The full run of the script still fails on T4, because of lack of GPU memory:

Loading model weights took 13.4966 GB
...
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 241.06 MiB is free. Process 122360 has 14.51 GiB memory in use. Of the allocated memory 14.38 GiB is allocated by PyTorch, and 4.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Added missing mode parameter

Added requirements.txt as seen on Google Collab T4 GPU runtime

added a list of missing packages

VictoryLoveJessica · 2024-10-05T06:45:27Z

Hello, did you encounter this error message when running the code?

DmitryKey · 2024-10-05T16:40:11Z

@VictoryLoveJessica yes, original code would produce this error.

DmitryKey added 3 commits September 14, 2024 16:00

Update arguments.py

9b58785

Added missing mode parameter

Create requirements.txt

24182f2

Added requirements.txt as seen on Google Collab T4 GPU runtime

Update README.md

9259261

added a list of missing packages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for running on T4 GPU #3

Fixes for running on T4 GPU #3

DmitryKey commented Sep 14, 2024

VictoryLoveJessica commented Oct 5, 2024

DmitryKey commented Oct 5, 2024

Fixes for running on T4 GPU #3

Are you sure you want to change the base?

Fixes for running on T4 GPU #3

Conversation

DmitryKey commented Sep 14, 2024

VictoryLoveJessica commented Oct 5, 2024

DmitryKey commented Oct 5, 2024