Llama3.cu - A LLaMA3-8B CUDA Inference Engine

Llama3.cu is a CUDA native implementation of the LLaMA3 architecture for causal language modeling. Core principles of the transformer architecture from the papers Attention is All You Need and LLaMA: Open and Efficient Foundation Language Models are implemented using custom CUDA kernel definitions, enabling scalable parallel processing on Nvidia GPUs.

The models are expected to be downloaded off of HuggingFace. They are stored as BF16 parameter weights in a .safetensor file, which during load time to the CUDA device, is converted to FP16 via a FP32 proxy. Hence, a CUDA device with a minimum of 24GB VRAM must be used.

Setup and Usage

Minimum Requirements:

- 24GB+ VRAM CUDA Device
- HuggingFace account
- Operating System: UNIX or WSL
- CUDA Toolkit (7.5+)

Run Inference

Run the setup-docker.sh file to setup your Virtual/Physical Machine to run Docker with access to Nvidia GPUs. Once the shell script has finished executing, make sure to log out of the terminal, and then log back in to run run-docker.sh.

# Setup Docker
chmod +x setup-docker.sh
./setup-docker.sh

# Restart terminal and run
chmod +x run-docker.sh
./run-docker.sh

For this inference engine to work, a SafeTensor formatted file(s) of the Llama3-8b model needs to be stored in the ./model_weights/ folder. Head to the HuggingFace - meta-llama/Llama-3.1-8B-Instruct repo to get access to the model. Additionally, Generate a Hugging Face Token so that the next step can successfully download the weights files.
Once the Docker container has started up, run the following command to store the Hugging Face token as an environment variable, replacing <your_token> with the token you generated.

export HF_TOKEN=<your_token>

Next, run the following command to download the model parameters into the target directory.

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./model_weights/ --token $HF_TOKEN

Run Make 🎉.

make run

Acknowledgments

Non exhaustive list of sources:

Attention Is All You Need
LLaMA: Open and Efficient Foundation Language Models
RoPE: Rotary Position Embedding for Robust, Efficient Transformer Models
This project makes use of the cJSON library by DaveGamble, which is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1,352 Commits
experiments		experiments
model_weights		model_weights
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
inference.png		inference.png
run-docker.sh		run-docker.sh
setup-docker.sh		setup-docker.sh
setup-tokenizer.py		setup-tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llama3.cu - A LLaMA3-8B CUDA Inference Engine

Setup and Usage

Minimum Requirements:

Run Inference

Acknowledgments

About

Releases

Packages

Languages

License

JoshuaVarley/Llama3.cu

Folders and files

Latest commit

History

Repository files navigation

Llama3.cu - A LLaMA3-8B CUDA Inference Engine

Setup and Usage

Minimum Requirements:

Run Inference

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages