Skip to content

LLM from scratch, no pretrained models, no HF transformers

Notifications You must be signed in to change notification settings

kreasof-ai/LLM-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

meme

LLM from scratch, no pre-trained models, no HF transformers

This is implementation of decoder-only transformer based LLM with next-token prediction objective. This implementation use tokenizers library from HF, GQA (Grouped query attention), normalized-GPT, RoPE (Rotary positional embedding), and Liger Kernel.

There are 6 versions:

  • Using AdamW optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
  • Using SOAP optimizer and lorem ipsum datasets (Broken RoPE) [colab notebook]
  • Using SOAP optimizer, synthetic number datasets, and larger parameter (Broken RoPE) [colab notebook]
  • Using SOAP optimizer, synthetic number datasets, smaller parameters, and larger epochs (Broken RoPE) [colab notebook]
  • Using SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, and Fast-FFN (fixed RoPE) [colab notebook]
  • Using tuned SOAP optimizer, harder synthetic number datasets, optimized hyperparameter, liger kernel applied, Fast-FFN, and normalized-GPT (fixed RoPE) [colab notebook]

We publish the weights from the latest version on HF Link

Notes: There's a small mistake in RoPE implementation where RoPE is applied to value_embedding (it should be applied only to query and key). The latest two versions fixes this issue.