This is instruction manual for understanding + using the mnist training run in CUDA
DISCLAIMER: ensure you have a GPU with compute capability 5.0 or greater (at least maxwell architecture). See compatibilty guide: https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html
git clone https://github.com/Infatoshi/mnist-cuda
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
We train an MLP on the MNIST dataset. We implement both the batched training run in pytorch, then translate over to CUDA C/C++ using iteratively optimized GPU kernels. I purposely left out batchnorm, residual blocks, lower-precision, and other optimizations to keep the code simple and easy to understand. It would also take wayyyy longer to implement and explain.
- Unified vs Explicit Memory in CUDA
- Maximizing Unified Memory Performance
- Prefetching is automatically taken care of by unified memory via streams (this is what is has lower latency in the github link above)
- CUDA streams - Lei Mao
- NVIDIA Docs
- Streams allow for overlapping data transfer (prefetching) with computation.
- While one stream is executing a kernel, another stream can be transferring data for the next computation.
- This technique is often called "double buffering" or "multi-buffering" when extended to more buffers.
we will change the following functions to kernels: matmul_a_bt and matmul_at_b relu_forward and relu_backward bias_forward and bias_backward softmax compute_grad_output compute_output_gradients compute_hidden_gradients update_gradients