Skip to content

Lightweight Llama 3 8B Inference Engine in CUDA C

License

Notifications You must be signed in to change notification settings

JoshuaVarley/Llama3.cu

 
 

Repository files navigation

Llama3.cu - A LLaMA3-8B CUDA Inference Engine

inference

Llama3.cu is a CUDA native implementation of the LLaMA3 architecture for causal language modeling. Core principles of the transformer architecture from the papers Attention is All You Need and LLaMA: Open and Efficient Foundation Language Models are implemented using custom CUDA kernel definitions, enabling scalable parallel processing on Nvidia GPUs.

The models are expected to be downloaded off of HuggingFace. They are stored as BF16 parameter weights in a .safetensor file, which during load time to the CUDA device, is converted to FP16 via a FP32 proxy. Hence, a CUDA device with a minimum of 24GB VRAM must be used.

Setup and Usage

Minimum Requirements:

- 24GB+ VRAM CUDA Device
- HuggingFace account
- Operating System: UNIX or WSL
- CUDA Toolkit (7.5+)

Run Inference

  1. Run the setup-docker.sh file to setup your Virtual/Physical Machine to run Docker with access to Nvidia GPUs. Once the shell script has finished executing, make sure to log out of the terminal, and then log back in to run run-docker.sh.
# Setup Docker
chmod +x setup-docker.sh
./setup-docker.sh
# Restart terminal and run
chmod +x run-docker.sh
./run-docker.sh
  1. For this inference engine to work, a SafeTensor formatted file(s) of the Llama3-8b model needs to be stored in the ./model_weights/ folder. Head to the HuggingFace - meta-llama/Llama-3.1-8B-Instruct repo to get access to the model. Additionally, Generate a Hugging Face Token so that the next step can successfully download the weights files.

  2. Once the Docker container has started up, run the following command to store the Hugging Face token as an environment variable, replacing <your_token> with the token you generated.

export HF_TOKEN=<your_token>
  1. Next, run the following command to download the model parameters into the target directory.
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./model_weights/ --token $HF_TOKEN
  1. Run Make 🎉.
make run

Acknowledgments

Non exhaustive list of sources:

  1. Attention Is All You Need

  2. LLaMA: Open and Efficient Foundation Language Models

  3. RoPE: Rotary Position Embedding for Robust, Efficient Transformer Models

  4. This project makes use of the cJSON library by DaveGamble, which is licensed under the MIT License.

About

Lightweight Llama 3 8B Inference Engine in CUDA C

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 51.2%
  • C 43.3%
  • Python 3.9%
  • Other 1.6%