Tags: NVIDIA/cudnn-frontend
# cudnn frontend v1.8 release notes (#118) ## New API ### Paged Attention API SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page-table descriptors. `SDPA_attributes` now accept `set_paged_attention_k_table` and `set_paged_attention_v_table` to input this descriptor. Please refer to samples for usage : [cpp samples](samples/cpp/sdpa/fp16_fwd_with_paged_caches.cpp), [python samples](samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb). See [docs](docs/operations/Attention.md) for more API details. ### cuda Graph API cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs: - `populate_cuda_graph` : add the cudnn nodes to the empty cuda_graph provided as input. - `update_cuda_graph` : update the populated cuda graph with necessary data pointers. See [docs](docs/cuda_graphs.md) and [backend documentation](https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnbackendpopulatecudagraph) for more details. ### Enhancements - Kernel cache for dynamic shapes are now supported in python. Added a [sample](test/python/test_kernel_cache.py) to showcase usage. - `graph.deselect_engines(str: )` has now a python equivalent through pybind11. - `graph.tensor(...)` can now accept `int64_t` scalars directly. (Previously limited to int32_t, float and fp16 data types). - fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above. - More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1. - SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above. ### Bug fixes - Fixed an issue while querying `cudnnGetLastErrorString()` from the backend. The error_t object will now have more meaningful message. - Fixed build issues seen with clang-19 compiler. - Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.
# cudnn FE 1.7.0 Release notes: (#111) ## New API - Kernel Cache support for dynamic graphs Added New APIs to enable kernel cache support for graphs with dynamic shapes. Please refer to [documentation](docs/dynamic_kernel_cache.md) for API details. Added examples `Convolution fprop dynamic shape`, `CSBR Graph dynamic shape`, `Matmul dynamic shape` and `Bias + Matmul dynamic shape` to showcase use of dynamic shapes and kernel cache. - Two new APIs to describe the plan in the form engine number and knobs are introduced. ``` error_t get_plan_name(std::string &name) const; error_t get_plan_name_at_index(int64_t plan_index, std::string &name) const; ``` Note: This name can be used later if you want to deselect_plan_by_name, if run into any potential errors. - Added an API to query tensor attributes from its UID in a graph. `query_tensor_with_uid(int64_t const uid, Tensor_attributes &tensor) const;` ## Improvements - sdpa fp16 bprop node can now compute dbias when padding mask is enabled. - sdpa fp8 (forward and bprop) nodes now support optional bias, dropout and padding mask. - Matmul fp8 node can now accept M,N,K overrides. - Added new python notebooks for implementing BatchNorm and BatchNorm bprop using cuDNN. - Updated [benchmark numbers](benchmark) with cudnn 9.4.0 for fp16 and fp8 datatypes. - Fixed compilation issues when `NV_CUDNN_DISABLE_EXCEPTION` is enabled. ## Bug fixes - Fixed a crash when the output dimension of dgrad node is not specified. This now returns an error message instead. - Fixed incorrect SDPA stats stride inferencing. - Fixed a bug in sdpa test when sliding window attention is enabled and query sequence length (s_q) is greater than key length (s_kv). This case is now not supported.
- cudnn FE 1.6.1 release (#99) - Bug fix - Fixed an issue where custom dropout mask was not correctly applied. - Added `-fvisibility=hidden` for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend. - Fixed an issue in sdpa kernels which will lead to numerical mismatches. - Fixed an issue in sdpa fp8 fprop kernels (in inference mode) - Samples - Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation. - Added a sample to shocase convolutions on large (`c * d * h * w > 2 ** 31`) tensors.
v1.6.0 release New API - Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added. - SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED. Bug Fixes - Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API. - SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node. Enhancements - Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size. - Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph). - Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later. - Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input. - Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph. - JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks. - Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls. - CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details. Samples - Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
Release notes for cudnn-frontend 1.5.0: (#81) [New feature] With cudnn backend 9.2.0 and above, `Graph::check_support` can determine support check for runtime engines without invoking the nvrtc compiler. This allows users to check the support surface of cudnn without invoking the nvrtc compilation. [New feature] Python pip wheel now contains the necessary c++ development headers. [New feature] Sliding window attention is now supported as an attribute to the sdpa forward and bprop node. Usage: `sdpa_attributes.set_sliding_window_length(window_length)` [New feature] Bottom right aligned causal masking is now supported as an attribute to the sdpa forward and bprop node. Usage: `sdpa_attributes.use_causal_mask_bottom_right(true)` [New feature] SDPA bprop attributes can choose deterministic algorithm using the `use_deterministic_algorithm` API. [New feature] Allow users to filter candidate execution plans of graph by its shared memory usage in cudnn 9.2.0 and later. [Bug fix] A runtime error if chosen execution plan candidate is incorrectly set in the backend has been fixed. This would happen when `check_support` does not correctly filter by the workspace size. [Bug fix] selecting/deselecting by behavior and numerical notes has now been fixed and works as intended. [Debugging] A new tool for easy reproduction of a failure using the json representation of the graph can be found [here](tools/json_reproducer). [Samples] Restructured the cpp samples into categories for easier navigation. [Samples] Added a sample to showcase how different plans can be built in parallel in separate threads. [Compilation enhancement] Added a new macro `CUDNN_FRONTEND_SKIP_NLOHMANN_JSON` as compilation flag to not have nlohman::json as compilation dependency. Users lose access to certain API functions like `print`, `key`, `serialize`, `deserialzie` that depend on the library. [Enhancement] Serialization of resample operation is now supported. [Enhancement] Bug template has been added for new github issues
[New] Added a benchmark folder which contains a sample docker file to (… …#73) compare cudnn implementation of sdpa with that of the pytorch implementation. [Enhancement] Once an engine is de-selected by name, it will not be built as part of check support. [Enhancement] The cudnn backend search order for wheels is as follows: (a) It will dlopen `libcudnn.so.MAJOR_VERSION` in the site packages. (b) It will try to dlopen unversioned libcudnn.so. This way pypi cudnn package nvidia-cudnn-cu* gets priority over default search path. [Enhancement] Allow embedding dimension up to 256 (currently limited to 128) in sdpa fprop operation. [Bug fix] Update the scale and bias shapes in batch norm sample.
cudnn frontend v1.3 release notes. (#72) [New API] Added new operations `sdpa_fp8_forward` and `sdpa_fp8_backward` to perform scaled dot prodcut attention of fp8 tensors. See more details in the `docs/operations/Attention.md` and cpp sample in `samples/cpp/mha.cpp`. Pybinds for the fp8 nodes are also added. [New API] Added new operation for resample forward operation. Add a new sample `samples/cpp/resample.cpp` to show its usage. [New API] Add a new API `deselect_engines(std::vector<std::string> const &engine_names)` which blocks certain engine configs from running. [New API] Add new APIs `select_numeric_notes` and `select_behavior_notes` to allow user select engine configs which have the selected numeric and behavior notes respectively. [Python API] Added a custom exception `cudnnGraphNotSupportedException` to the python API to distinguish between graphs that are actually not supported as compared to programming errors. [Python API] Added a new `backend_version_string` which returns the backend version in canonical form (eg. 9.1.0) instead of a version number. [Bug Fix] Updated the workspace computation for sdpa fprop node. Previously, workspace was calculated for alibi slopes irrespective of whether alibi mask was turned on or not. [Bug Fix] Fixed deserialization of pass by values of half precision.
cudnn frontend v1.2.1 release notes. (#69) [Bug Fix] cudnn-frontend pip wheels will now dlopen the fully version tag first `libucdnn.so.8` or `libcudnn.so.9` first before trying to load `libcudnn.so`. This means the pip wheels in the RUN_PATH will be prioritized over system paths (default behavior of dlopen). This can be overridden by setting the `LD_LIBRARY_PATH`. Source installation will now automatically look at cudnn in site packages before system path. [Documentation] Fixed the google-colab links in the jupyter notebooks. [Documentation] Added a jupyter notebook sample to go over the basics of cudnn FE graph API. `00_introduction.ipynb`