A port of llama2.c for the ESP32-S3 microcontroller. This project implements a lightweight version of the Llama 2 architecture optimized for embedded systems.
This project is aimed to create a very simple AI Dreaming Machine. The only thing this code does is generate little AI dreams.But it does it in only 512Kb of memory on a very underpowered micro controller.
Developed by Massimo Di Leo NuvolaProject starting from the wonderful works of A.Karpathy and D.Bennet.
There are some minor improvements over Bennet implementation of llama2.c on ESP32. I noticed that the original project generates more or less always the same story. I tweaked the code in order to add a little bit more of randomness in the seed generation. Also I changed the model from tiny stories to a custom trained version called aidreams260K. This model has been trained from a dataset of 2000 AI generated dreams. These dreams have beeen created with llama3-8b but with custom prompts in order to get a properly structured AI generated dreams, not human dreams.
- Runs on ESP32-S3 with minimal resources
- Custom vocabulary size of 512 tokens
- Optimized model architecture for embedded systems:
- Dimension: 64
- Layers: 4
- Heads: 4
- KV Heads: 4
- Max Sequence Length: 128
- Multiple of: 4
- ESP32-S3 development board
- Minimum 2MB PSRAM
- Minimum 4MB Flash
- ESP-IDF v4.4 or later
- Python 3.7 or later (for training and tokenizer)
- Clone this repository:
git clone https://github.com/mc9625/llama2-esp32.git
cd llama2-esp32
- Set up ESP-IDF environment:
. $HOME/esp/esp-idf/export.sh
- Configure the project:
idf.py set-target esp32s3
idf.py menuconfig
- Build and flash:
idf.py build
idf.py -p /dev/ttyUSB0 flash
The current model uses these parameters:
--vocab_source=custom
--vocab_size=512
--dim=64
--n_layers=4
--n_heads=4
--n_kv_heads=4
--multiple_of=4
--max_seq_len=128
--batch_size=128
src/llm.c
- Main LLM implementationsrc/llm.h
- Header file with data structures and function declarationssrc/main.c
- ESP32 application entry pointcomponents/
- External components and dependencies
- Flash: ~XMB for model weights
- PSRAM: ~YKB for runtime buffers
- RAM: ~ZKB for stack and heap
Current performance metrics:
- Inference speed: ~17 tokens/second
- Memory efficiency: Uses optimized data structures and FreeRTOS tasks
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.