Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

报错提示RuntimeError: Default process group has not been initialized, please make sure to call init_process_group #31

Open
Amireux52 opened this issue Jul 18, 2024 · 1 comment

Comments

@Amireux52
Copy link

当我进行debug的时候,发生如下问题:
2024-07-18 08:37:26,678 - mmdet - INFO - Checkpoints will be saved to /home/cxh/StreamMapNet/work_dirs/nusc_newsplit_480_60x30_24e by HardDiskBackend.
Backend TkAgg is interactive backend. Turning interactive mode on.
Traceback (most recent call last):
File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/pydevd.py", line 1500, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/cxh/StreamMapNet/tools/train.py", line 272, in
main()
File "/home/cxh/StreamMapNet/tools/train.py", line 261, in main
custom_train_model(
File "/home/cxh/StreamMapNet/plugin/core/apis/train.py", line 30, in custom_train_model
custom_train_detector(
File "/home/cxh/StreamMapNet/plugin/core/apis/mmdet_train.py", line 203, in custom_train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home/cxh/StreamMapNet/plugin/models/mapers/base_mapper.py", line 125, in train_step
loss, log_vars, num_samples = self(**data_dict)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxh/StreamMapNet/plugin/models/mapers/base_mapper.py", line 93, in forward
return self.forward_train(*args, **kwargs)
File "/home/cxh/StreamMapNet/plugin/models/mapers/StreamMapNet.py", line 173, in forward_train
_bev_feats = self.backbone(img, img_metas=img_metas, points=points)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxh/StreamMapNet/plugin/models/backbones/bevformer_backbone.py", line 173, in forward
mlvl_feats = self.extract_img_feat(img=img, img_metas=img_metas)
File "/home/cxh/StreamMapNet/plugin/models/backbones/bevformer_backbone.py", line 144, in extract_img_feat
img_feats = self.img_neck(img_feats)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmdet/models/necks/fpn.py", line 157, in forward
laterals = [
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmdet/models/necks/fpn.py", line 158, in
lateral_conv(inputs[i + self.start_level])
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/cnn/bricks/conv_module.py", line 209, in forward
x = self.norm(x)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size
return _get_group_size(group)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size
default_pg = _get_default_group()
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
python-BaseException
Traceback (most recent call last):
File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_frame.py", line 828, in trace_dispatch
if main_debugger.in_project_scope(frame.f_code.co_filename):
File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/pydevd.py", line 612, in in_project_scope
return pydevd_utils.in_project_roots(filename)
AttributeError: 'NoneType' object has no attribute 'in_project_roots'

Process finished with exit code 1
请问如何解决,感谢,盼复

@yuantianyuan01
Copy link
Owner

Hi, thanks for your interest in our project.

The error logs seem to be caused by synchronized batchnorm, which requires distributed training. We suggest always using tools/dist_train.sh to start training, even for debugging (you can set the world size to 1 for breakpoints). Also our dataset sampler can only be initialized correctly under DDP mode.

你好,谢谢你对我们项目的关注。

这个报错看起来是synchronized batchnorm导致的,不使用DDP的话就会报错,我建议即使是debug也使用tools/dist_train.sh进行训练,你可以把gpu数设成1,这样也可以设置断点。另外我们的data sampler也是只有在DDP模式下才能正确初始化。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants