Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize convolution batching rule performance #365

Open
Tracked by #394
zou3519 opened this issue Dec 23, 2021 · 7 comments
Open
Tracked by #394

Optimize convolution batching rule performance #365

zou3519 opened this issue Dec 23, 2021 · 7 comments
Labels
actionable It is clear what should be done for this issue
Milestone

Comments

@zou3519
Copy link
Contributor

zou3519 commented Dec 23, 2021

On CUDA, when the convolution batching rule uses group convolutions, this sometimes ends up being slower that we expect on older hardware. This is probably because PyTorch's group convolution calls the cudnn group convolution which is very unoptimized on older hardware.

We should try to optimize the performance of the group convolution path. I remember that unfold+matmul can be faster at times.

@zou3519 zou3519 added the actionable It is clear what should be done for this issue label Dec 23, 2021
@zou3519
Copy link
Contributor Author

zou3519 commented Dec 23, 2021

Or maybe we want something like a global configuration switch for "if group convolution calls unfold+mm" or not.

@vfdev-5
Copy link
Contributor

vfdev-5 commented Dec 27, 2021

IMO if we could have an option on which route to take (unfold+mm or cudnn group conv) would be better and user-friendly instead of optimizing for certain "old" hardware. Just optimizing for older hardware, I think, is not a good idea...

@zou3519
Copy link
Contributor Author

zou3519 commented Jan 4, 2022

An option on a route sounds reasonable

@samdow
Copy link
Contributor

samdow commented Jan 21, 2022

Some code pointers for the implementation: Opacus' verison for per sample gradients

  • Since torch.nn.unfold only works with 4D inputs, the first version of the flag will only change for conv2d (we've started some research into 5D inputs but that will require more substantial changes)
  • Opacus uses a custom unfold2d, but we've gotten per sample gradients to work just using torch.nn.unfold which will probably be more efficient too

@samdow
Copy link
Contributor

samdow commented Jan 28, 2022

Flagging that within the past ~month there's also been a substantial perf regression using group convolutions on A100s (the newest hardware). I can check what the comparison is on V100s + P100s to get data across the board. I still agree that we should have the flag, we just may want the default to be to use unfold

@zou3519 zou3519 added this to the 0.1.0 milestone Feb 3, 2022
@zou3519
Copy link
Contributor Author

zou3519 commented Mar 1, 2022

It's unclear if unfold + matmul is actually faster as a replacement for group convolution.

I did an experiment where I replaced all group convolutions with unfold + mm. For our ensembling example on a CNN, it is still not as performant as running a for-loop:

image

@zou3519
Copy link
Contributor Author

zou3519 commented Mar 1, 2022

Suggestion from Horace: add a flag to disable the batching rule for convolution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable It is clear what should be done for this issue
Projects
None yet
Development

No branches or pull requests

3 participants