This is implementation of decoder-only transformer based LLM with next-token prediction objective. This implementation use tokenizers
library from HF, GQA (Grouped query attention), normalized-GPT, RoPE (Rotary positional embedding), and Liger Kernel.
There are 6 versions:
- Using AdamW optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
- Using SOAP optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
- Using SOAP optimizer, synthetic number datasets, and larger parameter (Broken RoPE) [colab notebook]
- Using SOAP optimizer, synthetic number datasets, smaller parameters, and larger epochs (Broken RoPE) [colab notebook]
- Using SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, and Fast-FFN (fixed RoPE) [colab notebook]
- Using tuned SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, Fast-FFN, and normalized-GPT (fixed RoPE) [colab notebook]
We publish the weights from the latest version on HF Link
Notes: There's a small mistake in RoPE implementation where RoPE is applied to value_embedding (it should be applied only to query and key). The latest two versions fixes this issue.