Optimize convolution batching rule performance #365

zou3519 · 2021-12-23T14:28:35Z

On CUDA, when the convolution batching rule uses group convolutions, this sometimes ends up being slower that we expect on older hardware. This is probably because PyTorch's group convolution calls the cudnn group convolution which is very unoptimized on older hardware.

We should try to optimize the performance of the group convolution path. I remember that unfold+matmul can be faster at times.

zou3519 · 2021-12-23T22:48:25Z

Or maybe we want something like a global configuration switch for "if group convolution calls unfold+mm" or not.

vfdev-5 · 2021-12-27T09:36:18Z

IMO if we could have an option on which route to take (unfold+mm or cudnn group conv) would be better and user-friendly instead of optimizing for certain "old" hardware. Just optimizing for older hardware, I think, is not a good idea...

zou3519 · 2022-01-04T15:50:25Z

An option on a route sounds reasonable

samdow · 2022-01-21T17:57:00Z

Some code pointers for the implementation: Opacus' verison for per sample gradients

Since torch.nn.unfold only works with 4D inputs, the first version of the flag will only change for conv2d (we've started some research into 5D inputs but that will require more substantial changes)
Opacus uses a custom unfold2d, but we've gotten per sample gradients to work just using torch.nn.unfold which will probably be more efficient too

samdow · 2022-01-28T23:44:23Z

Flagging that within the past ~month there's also been a substantial perf regression using group convolutions on A100s (the newest hardware). I can check what the comparison is on V100s + P100s to get data across the board. I still agree that we should have the flag, we just may want the default to be to use unfold

zou3519 · 2022-03-01T17:53:02Z

It's unclear if unfold + matmul is actually faster as a replacement for group convolution.

I did an experiment where I replaced all group convolutions with unfold + mm. For our ensembling example on a CNN, it is still not as performant as running a for-loop:

zou3519 · 2022-03-01T18:52:50Z

Suggestion from Horace: add a flag to disable the batching rule for convolution

zou3519 added the actionable It is clear what should be done for this issue label Dec 23, 2021

zou3519 mentioned this issue Jan 11, 2022

Things that would be nice to have with the functorch release #394

Closed

29 tasks

zou3519 added this to the 0.1.0 milestone Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize convolution batching rule performance #365

Optimize convolution batching rule performance #365

zou3519 commented Dec 23, 2021

zou3519 commented Dec 23, 2021

vfdev-5 commented Dec 27, 2021

zou3519 commented Jan 4, 2022

samdow commented Jan 21, 2022

samdow commented Jan 28, 2022 •

edited

Loading

zou3519 commented Mar 1, 2022

zou3519 commented Mar 1, 2022

Optimize convolution batching rule performance #365

Optimize convolution batching rule performance #365

Comments

zou3519 commented Dec 23, 2021

zou3519 commented Dec 23, 2021

vfdev-5 commented Dec 27, 2021

zou3519 commented Jan 4, 2022

samdow commented Jan 21, 2022

samdow commented Jan 28, 2022 • edited Loading

zou3519 commented Mar 1, 2022

zou3519 commented Mar 1, 2022

samdow commented Jan 28, 2022 •

edited

Loading