Skip to content

Latest commit

 

History

History
17 lines (12 loc) · 2.12 KB

README.md

File metadata and controls

17 lines (12 loc) · 2.12 KB

meme

LLM from scratch, no pre-trained models, no HF transformers

This is implementation of decoder-only transformer based LLM with next-token prediction objective. This implementation use tokenizers library from HF, GQA (Grouped query attention), normalized-GPT, RoPE (Rotary positional embedding), and Liger Kernel.

There are 6 versions:

  • Using AdamW optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
  • Using SOAP optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
  • Using SOAP optimizer, synthetic number datasets, and larger parameter (Broken RoPE) [colab notebook]
  • Using SOAP optimizer, synthetic number datasets, smaller parameters, and larger epochs (Broken RoPE) [colab notebook]
  • Using SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, and Fast-FFN (fixed RoPE) [colab notebook]
  • Using tuned SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, Fast-FFN, and normalized-GPT (fixed RoPE) [colab notebook]

We publish the weights from the latest version on HF Link

Notes: There's a small mistake in RoPE implementation where RoPE is applied to value_embedding (it should be applied only to query and key). The latest two versions fixes this issue.