What's new
Added π
- Added support for tensor parallelism. See the
TransformerConfig
class for usage. - Added more downstream tasks from the model ladder.
- Added
io.copy_dir()
function. - Added new LR schedulers:
LinearWithWarmup
,InvSqrtWithWarmup
,ConstantWithWarmup
,SequentialScheduler
. - Added option to pre-download checkpoint files from remote storage before trying to load a checkpoint.
- Added a callback for sending Slack notifications.
- Makes the MPS device work on Apple Silicon
- Added
SkipStepAdamW
optimizer. - The trainer can load model-only checkpoints now.
- Added the option to throttle checkpoint uploads to one rank from each node at a time.
- Added support for logging rich Table objects as text in source mixture datasets.
- Added
unshard_strategy
parameter tounshard_checkpoint()
function inolmo_core.distributed.checkpoint
. - Added function
load_keys()
toolmo_core.distributed.checkpoint
.
Changed β οΈ
- Changed storage of shared shard state in sharded checkpoints from smallest shard to lowest rank (normally 0).
- Changed how the trainer handles loading a checkpoint when
load_path
is provided. Nowload_path
is only used if no checkpoint is found in thesave_folder
.
Fixed β
- Added missing
weights_only=False
argument to fix loading train checkpoints with newer versions of PyTorch. - Fixed bug where GCS upload does not retry on transient failures.
- Fixed bug where source mixture datasets were truncating source files instead of randomly sampling.
- Fixed bug in source mixture datsets where sampling from small npy files raised an mmap exception due to 0 instances in the sampled index.
Commits
7899e7c (chore) prepare for release v1.8.0
907b9c5 Send Slack notification on releases (#151)
1ef7851 fix get_mock_batch()
when training on MPS again
29a468d Fix mixture dataset class (#147)
98ccb67 remove ganymede cluster
205fe90 remove deleted cluster
7ec9114 always make mock batch on CPU
7122b1d save max steps to trainer state (#143)
9a78829 Log elapsed time per eval (#149)
075a36a Make training on the MPS device work (#131)
b4a195b Add more options to the unshard_checkpoint
function to help scale (#145)
16885ab fix merge list with prefix
7b755c9 minor logging improvement
212108f Add option to throttle checkpoint uploads to one rank from each node at a time (#142)
7633461 pull fixes from 32B branch (#139)
48abe8c checkpoint hot fix (#140)
0c096e2 Handle model-only checkpoints with the trainer
9818232 move release scripts to subfolder (#137)
05ab673 update cluster list (#136)
7ccf726 add pr comments on release
0ff19d7 update citation
7519e0a Change the way load_path
is handled (#132)
03a597a limit the number of exception lines posted to Slack
c634066 include link to Beaker job with Slack noties
3505660 Make context manager set original state correctly (#126)
9e0992b Add a callback for sending Slack notifications (#125)
6d60464 fix
ee27348 Sync eval changes in OLMo/ladder-1xC to here (#122)
0789479 Add option to pre-download checkpoint to load (#123)
1380f0e add copy_dir()
io function
5cc704f Add learning rate schedulers (#119)
de5be27 don't check for beaker-py upgrades
b0103f0 Fix loading train state for newer versions of torch
5de774f updates
8474ee8 update docker image tags
d3f6f01 Update PyTorch and other deps in Docker images, change naming scheme of images (#120)
10c4978 Publish Docker images to GHCR (#118)
d6981b3 Add support for tensor parallelism and add OLMo2-26B model config / train script (#117)
aa4d188 Update table formatting