Skip to content
/ FOCUS Public

Official repository for "FOCUS: First Order Concentrated Updating Scheme"

License

Notifications You must be signed in to change notification settings

liuyz0/FOCUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Good training may require FOCUS

This is the github repo for the paper "FOCUS: First Order Concentrated Updating Scheme".

Our results contribute to both scientific understanding of training dynamics and practical speedup in LLM pretraining. We propose a minimum model with sharp valley and gradient noise to study training dynamics. We then have a better understanding of Adam’s advantage in dealing with the sharpness and its limitation when gradient noise is large. Finally, we propose FOCUS using attraction force to squeeze into valleys without decreasing effective step size much, which gains actual speedup in training GPT-2 (small).

FOCUS: First Order Concentrated Updating Scheme

The pseudocode is given as follows and the code is in focus.py. The hyparameter $\beta_2$ is the decay rate for the EMA of parameters and $\gamma$ controls the strength of "attraction force".

Alt Text

Toy landscape

Our picture of the LLM landscape is a narrowing valley. We scanned the best performance of different optimizers on this toy landscape. The best optimizer at a given condition is represented by a color (Adam: orange; Signum: yellow; FOCUS: blue), yielding the following phase diagram (panels a and b). The code can be found in './Toy'. See detailed explainations of the code in Appendix.

Alt Text

MNIST

We found that small batch sizes (larger noise in gradient) indeed lead to Signum outperforming Adam in MNIST classification (above figure, panels c and d). The code for this is in './MNIST'. See detailed explainations of the code in Appendix.

GPT-2

The GPT-2 training code is slightly modified from the Sophia paper. We simply added FOCUS to './GPT2/model.py' and changed to use 8 V100 GPUs on 4 nodes (see .sh files in './GPT2').

FOCUS is more stable than Signum and Adam on our machines. Compared to Adam baseline in the Sophia paper, FOCUS is also faster.

Alt Text

About

Official repository for "FOCUS: First Order Concentrated Updating Scheme"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published