Skip to content

v1.8.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 30 Jan 01:12
· 2 commits to main since this release

What's new

Added πŸŽ‰

  • Added support for tensor parallelism. See the TransformerConfig class for usage.
  • Added more downstream tasks from the model ladder.
  • Added io.copy_dir() function.
  • Added new LR schedulers: LinearWithWarmup, InvSqrtWithWarmup, ConstantWithWarmup, SequentialScheduler.
  • Added option to pre-download checkpoint files from remote storage before trying to load a checkpoint.
  • Added a callback for sending Slack notifications.
  • Makes the MPS device work on Apple Silicon
  • Added SkipStepAdamW optimizer.
  • The trainer can load model-only checkpoints now.
  • Added the option to throttle checkpoint uploads to one rank from each node at a time.
  • Added support for logging rich Table objects as text in source mixture datasets.
  • Added unshard_strategy parameter to unshard_checkpoint() function in olmo_core.distributed.checkpoint.
  • Added function load_keys() to olmo_core.distributed.checkpoint.

Changed ⚠️

  • Changed storage of shared shard state in sharded checkpoints from smallest shard to lowest rank (normally 0).
  • Changed how the trainer handles loading a checkpoint when load_path is provided. Now load_path is only used if no checkpoint is found in the save_folder.

Fixed βœ…

  • Added missing weights_only=False argument to fix loading train checkpoints with newer versions of PyTorch.
  • Fixed bug where GCS upload does not retry on transient failures.
  • Fixed bug where source mixture datasets were truncating source files instead of randomly sampling.
  • Fixed bug in source mixture datsets where sampling from small npy files raised an mmap exception due to 0 instances in the sampled index.

Commits

7899e7c (chore) prepare for release v1.8.0
907b9c5 Send Slack notification on releases (#151)
1ef7851 fix get_mock_batch() when training on MPS again
29a468d Fix mixture dataset class (#147)
98ccb67 remove ganymede cluster
205fe90 remove deleted cluster
7ec9114 always make mock batch on CPU
7122b1d save max steps to trainer state (#143)
9a78829 Log elapsed time per eval (#149)
075a36a Make training on the MPS device work (#131)
b4a195b Add more options to the unshard_checkpoint function to help scale (#145)
16885ab fix merge list with prefix
7b755c9 minor logging improvement
212108f Add option to throttle checkpoint uploads to one rank from each node at a time (#142)
7633461 pull fixes from 32B branch (#139)
48abe8c checkpoint hot fix (#140)
0c096e2 Handle model-only checkpoints with the trainer
9818232 move release scripts to subfolder (#137)
05ab673 update cluster list (#136)
7ccf726 add pr comments on release
0ff19d7 update citation
7519e0a Change the way load_path is handled (#132)
03a597a limit the number of exception lines posted to Slack
c634066 include link to Beaker job with Slack noties
3505660 Make context manager set original state correctly (#126)
9e0992b Add a callback for sending Slack notifications (#125)
6d60464 fix
ee27348 Sync eval changes in OLMo/ladder-1xC to here (#122)
0789479 Add option to pre-download checkpoint to load (#123)
1380f0e add copy_dir() io function
5cc704f Add learning rate schedulers (#119)
de5be27 don't check for beaker-py upgrades
b0103f0 Fix loading train state for newer versions of torch
5de774f updates
8474ee8 update docker image tags
d3f6f01 Update PyTorch and other deps in Docker images, change naming scheme of images (#120)
10c4978 Publish Docker images to GHCR (#118)
d6981b3 Add support for tensor parallelism and add OLMo2-26B model config / train script (#117)
aa4d188 Update table formatting