This example contains implementation of single-precision general matrix-multiplication (SGEMM). The implementation is based on the one in MAGMA.
The demo contains a script that calculates matrix multiplication of A (m x k) and B (k x n). The demo can be run by the following command.
python sgemm.py [--gpu GPU_ID] [--m m] [--n n] [--k k]
In this example, we work on a SGEMM kernel that requires a complete interface to cuLaunchKernel
(e.g. grid size and size of shared memory), which is not provided by cupy.ElementwiseKernel
.
CuPy arrays work regardless of the underlying memory layouts thanks to ndarray
abstraction.
As is the case for this example, ndarray
abstraction does not need to be used if the underlying memory layouts of arrays match the ones expected by a kernel.
The SGEMM kernel expects input and output arrays to be in Fortran contiguous memory layout, and this layout is enforced by cupy.asfortranarray
.
For compilation, cupy.RawKernel
class is used to compile a CUDA code written in sgemm.cu
.
The class takes a text of code and name of the kernel as an constructor argument.
The instance is a callable; the CUDA code will be compiled and then invoked when it is called.
The compiled code is cached, and it avoids the compilation process after the first time.
Also, the CUDA code can be modified at Python level because it is simply a text.
In this example, C macros that determine a distribution of data to threads are specified at runtime.
Note that "extern C"
needs to be put on top of the kernel that is called.
cupy.RawKernel
object allows you to call the kernel with CUDA's cuLaunchKernel
interface.
In other words, you have control over grid size, block size, shared memory size and stream.
At this level of interface, it becomes straightforward to replace host .cu
that calls CUDA kernels with Python code.