LLM from scratch, no pre-trained models, no HF transformers

This is implementation of decoder-only transformer based LLM with next-token prediction objective. This implementation use tokenizers library from HF, GQA (Grouped query attention), normalized-GPT, RoPE (Rotary positional embedding), and Liger Kernel.

There are 6 versions:

Using AdamW optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
Using SOAP optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
Using SOAP optimizer, synthetic number datasets, and larger parameter (Broken RoPE) [colab notebook]
Using SOAP optimizer, synthetic number datasets, smaller parameters, and larger epochs (Broken RoPE) [colab notebook]
Using SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, and Fast-FFN (fixed RoPE) [colab notebook]
Using tuned SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, Fast-FFN, and normalized-GPT (fixed RoPE) [colab notebook]

We publish the weights from the latest version on HF Link

Notes: There's a small mistake in RoPE implementation where RoPE is applied to value_embedding (it should be applied only to query and key). The latest two versions fixes this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLM from scratch, no pre-trained models, no HF transformers

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLM from scratch, no pre-trained models, no HF transformers