Most of the examples show how to use Kernel Tuner to tune a CUDA, OpenCL, or C kernel, while demonstrating a particular usecase of Kernel Tuner.
Except for test_vector_add.py and test_vector_add_parameterized.py, which show how to write tests for GPU kernels with Kernel Tuner.
Below we list the example applications and the features they illustrate.
- [CUDA] [OpenCL]
- use a 2-dimensional problem domain with 2-dimensional thread blocks in a simple and clean example
- [CUDA] [OpenCL]
- pass a filename instead of a string with code
- use 2-dimensional thread blocks and tiling in both dimensions
- tell Kernel Tuner to compute the grid dimensions for 2D thread blocks with tiling
- use the restrictions option to limit the search to only valid configurations
- use a user-defined performance metric like GFLOP/s
There are several different examples centered around the convolution kernel [CUDA] [OpenCL]
- [CUDA] [OpenCL]
- use tunable parameters for tuning for multiple input sizes
- pass constant memory arguments to the kernel
- write output to a json file
- [CUDA] [OpenCL]
- use the convolution kernel for separable filters
- write output to a csv file using Pandas
- [CUDA] [OpenCL]
- use run_kernel to compute a reference answer
- verify the output of every benchmarked kernel
- [CUDA]
- allocate page-locked host memory from Python
- overlap transfers to and from the GPU with computation
- tune parameters in the host code in combination with those in the kernel
- use the lang="C" option and set compiler options
- pass a list of filenames instead of strings with kernel code
- [CUDA] [OpenCL]
- use vector types and shuffle instructions (shuffle is only available in CUDA)
- tune the number of thread blocks the kernel is executed with
- tune the partial loop unrolling factor of a for-loop
- tune pipeline that consists of two kernels
- tune with custom output verification function
- [CUDA]
- use scipy to compute a reference answer and verify all benchmarked kernels
- express that the number of thread blocks depends on the values of tunable parameters
- [CUDA]
- overlap transfers with device mapped host memory
- tune on different implementations of an algorithm
- [CUDA]
- in-thread block 2D reduction using CUB library
- C++ in CUDA kernel code
- tune multiple kernels in pipeline