Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support load merged checkpoint #70105

Merged
merged 4 commits into from
Dec 17, 2024

Conversation

zhiqiu
Copy link
Contributor

@zhiqiu zhiqiu commented Dec 10, 2024

PR Category

Auto Parallel

PR Types

New features

Description

  • support load merged checkpoint
  • fix the checkpoint version management problem:
    • allow user to specify unique_id, and when the specified unique_id is exists, the existing checkpoint will be overwrite
    • allow user to delete ckpt in given path
    • fix load checkpoint from multiple version

usage:

import paddle
import paddle.distributed as dist

ckpt_path='checkpoints/llama3.1_pretrain_ckpts/checkpoint-2/dist_ckpt'
print('maxid', dist.checkpoint.utils.get_max_id('checkpoints/llama3.1_pretrain_ckpts/checkpoint-2/dist_ckpt'))

unsharded_state_dict = dist.checkpoint.load_state_dict.load_merged_state_dict(ckpt_path, offload=1) # load unsharded checkpoint
print(f"unsharded_state_dict:{unsharded_state_dict}")

Pcard-76459

Copy link

paddle-bot bot commented Dec 10, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@zhiqiu zhiqiu force-pushed the dev/support_load_merged_ckpt branch 2 times, most recently from 3ba741d to 8f3b2f0 Compare December 16, 2024 04:09
@zhiqiu zhiqiu force-pushed the dev/support_load_merged_ckpt branch from 8f3b2f0 to 0ea273d Compare December 16, 2024 04:09
@zhiqiu zhiqiu force-pushed the dev/support_load_merged_ckpt branch from 0ea273d to 9504e5c Compare December 16, 2024 04:14
Copy link
Contributor

@jeff41404 jeff41404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

async_save(bool): Async save the state_dict, default is False.

Note: If there is already checkpoint in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note: If there is already checkpoint in
Note:
If there is already checkpoint in

image

@zhiqiu zhiqiu merged commit bcdfeed into PaddlePaddle:develop Dec 17, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants