-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Divergent loss with SimCLR #1633
Comments
This looks interesting, I haven't encountered this before. What type of data are you using? And do you use sync batchnorm? |
I am using microscopy data (large image; typically 2000x2000 pixels) from which I randomly crop 224x224 pixels size images. Using a grid-like sampling I can generate >750k crops that can be used for training. I did set the Something else that I realised is that the value of the loss function becomes constant at 7.624 (mean, min, and max). I tracked these values as well. This value somewhat corresponds to the loss value that I can obtain from two random vectors of size 1024x128 in the |
I could imagine that you face some numerical instabilities
We have used SimCLR and all kinds of data including medical and microscopy and haven't had issues. |
Thank you for taking the time to answer.
Again, thank you for the feedback. I will update if I find a fix/solution. |
I have tracked some of the weights during training. One thing that I notice is that the weights of the |
Sometimes, when training using the SimCLR method I get some divergent loss function (see attached screenshot). I wonder if anyone has ever experienced this kind of issue when training with SimCLR. This has happened to me on different occasions with ResNet-18/50 model.
I don't think that this is an issue with the code, but if anyone has ever seen this kind of problem I would be grateful for your input.
Here's some information about the training hyper-parameters:
batch_size
of 256 per GPUbatch_size
: 1024criterion
:NTXentLoss
with temperature of 0.1 andgather_distributed=True
optimizer
: LARS with a base learning rate of 0.3 and default parameters from herescheduler
:CosineWarmupScheduler
with 10k warmup stepsThe text was updated successfully, but these errors were encountered: