Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed in multiple GPU training #56

Open
EvW1998 opened this issue Jan 4, 2024 · 2 comments
Open

Failed in multiple GPU training #56

EvW1998 opened this issue Jan 4, 2024 · 2 comments

Comments

@EvW1998
Copy link

EvW1998 commented Jan 4, 2024

I could train with a single GPU, but when I try to run with multiple GPU by running dist_train.sh, the program stopped without reporting anything.

My dist_train.sh is like this:

CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 --master_port 29501 train.py --launcher pytorch > log.txt&

The log.txt shows like this:

/usr/local/miniconda3/envs/pcdt/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Feels like something wrong with distributed, any ideas? Thanks

@vehxianfish
Copy link

Hi, @EvW1998. I also meet this question. Do you solve it?

@zhaojinbiao
Copy link

Hi, @EvW1998. I also meet this question. Do you solve it?

I also meet this question. Do you solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants