Skip to content

Commit

Permalink
Save checkpoints locally to /mnt/local_storage before reporting to tr…
Browse files Browse the repository at this point in the history
…ain (ray-project#42157)

70B finetuning ran into OOD issues because it saved the checkpoints to `/tmp`. We should save it to `/mnt/local_storage` instead.

Signed-off-by: Huaiwei Sun <[email protected]>
  • Loading branch information
scottsun94 authored Jan 3, 2024
1 parent 4df79bb commit 3759098
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -487,7 +487,7 @@ def training_function(kwargs: dict):
"learning_rate": lr_scheduler.get_lr()[0],
}

with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
with tempfile.TemporaryDirectory(dir=args.output_dir) as temp_checkpoint_dir:
accelerator.print(f"Saving the model locally at {temp_checkpoint_dir}")
accelerator.wait_for_everyone()

Expand Down

0 comments on commit 3759098

Please sign in to comment.