All notable changes to this project will be documented in this file. This project adheres to Semantic Versioning.
- new optimization strategies: dual annealing, greedly ILS, ordered greedy MLS, greedy MLS
- support for constant memory in cupy backend
- Alternative Bayesian Optimization strategies that could not be used directly
- C++ wrapper module that was too specific and hardly used
- support for PyTorch Tensors as input data type for kernels
- support for smem_args in run_kernel
- support for (lambda) function and string for dynamic shared memory size
- a new Bayesian Optimization strategy
- optionally store the kernel_string with store_results
- improved reporting of skipped configurations
- support for (lambda) function instead of list of strings for restrictions
- support for (lambda) function instead of list for specifying grid divisors
- support for (lambda) function instead of tuple for specifying problem_size
- function to store the top tuning results
- function to create header file with device targets from stored results
- support for using tuning results in PythonKernel
- option to control measurements using observers
- support for NVML tunable parameters
- option to simulate auto-tuning searches from existing cache files
- Cupy backend to support C++ templated CUDA kernels
- support for templated CUDA kernels using PyCUDA backend
- documentation on tunable parameter vocabulary
- support loop unrolling using params that start with loop_unroll_factor
- always insert "define kernel_tuner 1" to allow preprocessor ifdef kernel_tuner
- support for user-defined metrics
- support for choosing the optimization starting point x0 for most strategies
- more compact output is printed to the terminal
- sequential runner runs first kernel in the parameter space to warm up device
- updated tutorials to demonstrate use of user-defined metrics
- kernelbuilder functionality for including kernels in Python applications
- smem_args option for dynamically allocated shared memory in CUDA kernels
- bugfix for Nvidia devices without internal current sensor
- fix for output checking, custom verify functions are called just once
- benchmarking now returns multiple results not only time
- more sophisticated implementation of genetic algorithm strategy
- how the "method" option is passed, now use strategy_options
- Bayesian Optimizaton strategy, use strategy="bayes_opt"
- support for kernels that use texture memory in CUDA
- support for measuring energy consumption of CUDA kernels
- option to set strategy_options to pass strategy specific options
- option to cache and restart from tuned kernel configurations cachefile
- Python 2 support, it may still work but we no longer test for Python 2
- Noodles parallel runner
- no longer replacing kernel names with instance strings during tuning
- bugfix in tempfile creation that lead to too many open files error
- A minimal Fortran example and basic Fortran support
- Particle Swarm Optimization strategy, use strategy="pso"
- Simulated Annealing strategy, use strategy="simulated_annealing"
- Firefly Algorithm strategy, use strategy="firefly_algorithm"
- Genetic Algorithm strategy, use strategy="genetic_algorithm"
- bugfix for C backend for byte array arguments
- argument type mismatches throw warning instead of exception
- wrapper functionality to wrap C++ functions
- citation file and zenodo doi generation for releases
- bugfix for when using iterations smaller than 3
- the install procedure now uses extras, e.g. [cuda,opencl]
- option quiet makes tune_kernel completely quiet
- extensive updates to documentation
- type checking for kernel arguments and answers lists
- checks for reserved keywords in tunable paramters
- checks for whether thread block dimensions are specified
- printing units for measured time with CUDA and OpenCL
- option to print all measured execution times
- bugfix install when scipy not present
- bugfix for GPU cleanup when using Noodles runner
- reworked the way strings are handled internally
- option to set compiler name, when using C backend
- actively freeing GPU memory after tuning
- bugfix for 3D grids when using OpenCL
- support for dynamic parallelism when using PyCUDA
- option to use differential evolution optimization
- global optimization strategies basinhopping, minimize
- option to pass a fraction to the sample runner
- fixed a bug in memset for OpenCL backend
- parallel tuning on single node using Noodles runner
- option to pass new defaults for block dimensions
- option to pass a Python function as code generator
- option to pass custom function for output verification
- device and kernel name are printed by runner
- tune_kernel also returns a dict with environment info
- using different timer in C vector add example
- changed how scalar arguments are handled internally
- separate install and contribution guides
- allow non-tuple problem_size for 1D grids
- changed default for grid_div_y from None to block_size_y
- converted the tutorial to a Jupyter Notebook
- CUDA backend prints device in use, similar to OpenCL backend
- migrating from nosetests to pytest
- rewrote many of the examples to save results to json files
- full support for 3D grids, including option for grid_div_z
- separable convolution example
- changed the output format to list of dictionaries
- option to set compiler options
- verbose now also prints debug output when correctness check fails
- restructured the utility functions into util and core
- restructured the code to prepare for different strategies
- shortened the output printed by the tune_kernel
- allowing numpy integers for specifying problem size
- a public roadmap
- requirements.txt
- example showing GPU code unit testing with the Kernel Tuner
- support for passing a (list of) filenames instead of kernel string
- runner that takes a random sample of 10 percent
- support for OpenCL platform selection
- support for using tuning parameter names in the problem size
- A function to type check the arguments to the kernel
- Example (convolution) that tunes the number of streams
- Device interface to C functions, for tuning host code
- Correctness checks for kernels during tuning
- Function for running a single kernel instance
- CHANGELOG file
- Compute Cartesian product and process restrictions before main loop
- Python 3.5 compatible code, thanks to Berend
- Support for constant memory arguments to CUDA kernels
- Use of mocking in unittests
- Reporting coverage to codacy
- OpenCL support
- Documentation pages with Convolution and Matrix Multiply examples
- Inspecting device properties at runtime
- Basic Kernel Tuning functionality