[AutoParallel] optimize sharding stage1 tensor fusion save&load strategy #70309

AndSonder · 2024-12-18T09:14:53Z

PR Category

Auto Parallel

PR Types

Others

Description

为 tensor-fusion 添加开关 enable_stage1_tensor_fusion，该开关和动手对齐用于控制 tensor_fusion 的开关
添加非均匀 save&load 的方案：之前的方案中是将 tensor-fusion 的非均匀参数通信回到均匀状态，但是实际测试中发现 load 的过程中会引入较大的显存增加，非均匀 save&load 方案中给 tensor-fusion 场景下 slice 的参数修改参数的名字后缀 _rankn 每张卡 slice 参数的名字不一样，这样可以保留每个参数的 metadata 信息，同时避免同名参数的通信。方案支持后添加了 save_unbalanced_param 参数默认打开表示使用非均匀 save&load的方案
优化了均匀 save&load 方案的显存，但是优化后还是会引入较多显存增长，所以默认不使用该方案
补充单测 case

相关 PR：

[AutoParallel] support sharding tensor-fusion save&load #69823

Pcard-76459

paddle-bot · 2024-12-18T11:44:02Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… add_flag

add flag

49ca649

paddle-bot bot added the contributor External developers label Dec 18, 2024

AndSonder added 5 commits December 18, 2024 22:38

update

7b91c46

fix

30b25f6

add_flag

5f27544

update

2a47584

merge dev

e1e6b61

AndSonder changed the title ~~[AutoParallel] add FLAGS_enable_sharding_stage1_tensor_fusion flag~~ [AutoParallel] optimize sharding stage1 tensor fusion strategy Dec 24, 2024

AndSonder changed the title ~~[AutoParallel] optimize sharding stage1 tensor fusion strategy~~ [AutoParallel] optimize sharding stage1 tensor fusion save&load strategy Dec 24, 2024

AndSonder added 4 commits December 25, 2024 00:17

fix

fc06a37

fix

b153388

fix test

43adae7

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5fba793

… add_flag

winter-wang approved these changes Dec 26, 2024

View reviewed changes

winter-wang merged commit 8d97458 into PaddlePaddle:develop Dec 26, 2024
28 of 29 checks passed

AndSonder mentioned this pull request Dec 27, 2024

[AutoParallel] add parameter enable_stage1_tensor_fusion_blanced_save_load and enable_stage1_tensor_fusion PaddlePaddle/PaddleNLP#9714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoParallel] optimize sharding stage1 tensor fusion save&load strategy #70309

[AutoParallel] optimize sharding stage1 tensor fusion save&load strategy #70309

AndSonder commented Dec 18, 2024 •

edited

Loading

paddle-bot bot commented Dec 18, 2024

[AutoParallel] optimize sharding stage1 tensor fusion save&load strategy #70309

[AutoParallel] optimize sharding stage1 tensor fusion save&load strategy #70309

Conversation

AndSonder commented Dec 18, 2024 • edited Loading

PR Category

PR Types

Description

paddle-bot bot commented Dec 18, 2024

AndSonder commented Dec 18, 2024 •

edited

Loading