After a brief email exchange with Alex, he suggested that the easiest way to do benchmarking is to write a small C/C++ wrapper around cudaconv3 (where all the kernels are). I took this route, except that I wrote a Torch/FFI wrapper around the kernels (instead of C/C++), the repository can be found at
For details on installing torch, look at the in the torch7 folder Assuming torch is already installed, it can be installed with
luarocks install
The benchmark is included with benchmark.lua in the torch7 folder
The benchmark can be run with the command:
th benchmark.lua