Skip to content

Tags: NVIDIA/cudnn-frontend

Tags

v1.8.0

Toggle v1.8.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
# cudnn frontend v1.8 release notes (#118)

## New API

### Paged Attention API
SDPA forward operation now supports paged attention on cudnn 9.5.0 and
later by setting the appropriate page-table descriptors.
`SDPA_attributes` now accept `set_paged_attention_k_table` and
`set_paged_attention_v_table` to input this descriptor. Please refer to
samples for usage : [cpp
samples](samples/cpp/sdpa/fp16_fwd_with_paged_caches.cpp), [python
samples](samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb).
See [docs](docs/operations/Attention.md) for more API details.

### cuda Graph API
cudnn graph now allows user to directly build native cuda_graph for
given sub_graph (requires cudnn 9.5.0). There are two APIs:
 - `populate_cuda_graph` : add the cudnn nodes to the empty cuda_graph
 provided as input.
  - `update_cuda_graph` : update the populated cuda graph with necessary
  data pointers.
  See [docs](docs/cuda_graphs.md) and [backend
  documentation](https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnbackendpopulatecudagraph)
  for more details.

### Enhancements

- Kernel cache for dynamic shapes are now supported in python. Added a
[sample](test/python/test_kernel_cache.py) to showcase usage.

- `graph.deselect_engines(str: )` has now a python equivalent through
pybind11.

- `graph.tensor(...)` can now accept `int64_t` scalars directly.
(Previously limited to int32_t, float and fp16 data types).

- fp8 sdpa attention now allows dropout and padding mask. Requires cudnn
9.5.0 and above.

- More enhancements to pointwise output stride inferencing (for
broadcast operation). For non-unary operands, the broadcasted tensor can
now be either at IN_0 or IN_1.

- SDPA backward operation now allows d upto 256 for Hopper. Requires
cudnn 9.5.0 and above.

### Bug fixes

- Fixed an issue while querying `cudnnGetLastErrorString()` from the
backend. The error_t object will now have more meaningful message.

- Fixed build issues seen with clang-19 compiler.

- Fixed an issue where it was assumed a graph with bias in sdpa_bprop
will always have a dbias.

v1.7.0

Toggle v1.7.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
# cudnn FE 1.7.0 Release notes: (#111)

## New API

- Kernel Cache support for dynamic graphs
Added New APIs to enable kernel cache support for graphs with dynamic shapes. Please refer to [documentation](docs/dynamic_kernel_cache.md) for API details.

Added examples `Convolution fprop dynamic shape`, `CSBR Graph dynamic shape`, `Matmul dynamic shape` and `Bias + Matmul dynamic shape` to showcase use of dynamic shapes and kernel cache.

- Two new APIs to describe the plan in the form engine number and knobs are introduced.
```
error_t
get_plan_name(std::string &name) const;

error_t
get_plan_name_at_index(int64_t plan_index, std::string &name) const;
```
Note:
This name can be used later if you want to deselect_plan_by_name, if run into any potential errors.

- Added an API to query tensor attributes from its UID in a graph.
`query_tensor_with_uid(int64_t const uid, Tensor_attributes &tensor) const;`

## Improvements

- sdpa fp16 bprop node can now compute dbias when padding mask is enabled.

- sdpa fp8 (forward and bprop) nodes now support optional bias, dropout and padding mask.

- Matmul fp8 node can now accept M,N,K overrides.

- Added new python notebooks for implementing BatchNorm and BatchNorm bprop using cuDNN.

- Updated [benchmark numbers](benchmark) with cudnn 9.4.0 for fp16 and fp8 datatypes.

- Fixed compilation issues when `NV_CUDNN_DISABLE_EXCEPTION` is enabled.

## Bug fixes

- Fixed a crash when the output dimension of dgrad node is not specified. This now returns an error message instead.

- Fixed incorrect SDPA stats stride inferencing.

- Fixed a bug in sdpa test when sliding window attention is enabled and query sequence length (s_q) is greater than key length (s_kv). This case is now not supported.

v1.6.1

Toggle v1.6.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
- cudnn FE 1.6.1 release (#99)

-  Bug fix

- Fixed an issue where custom dropout mask was not correctly applied.
- Added `-fvisibility=hidden` for the pip wheels generated to avoid
symbol conflicts with other modules that use cudnn frontend.
- Fixed an issue in sdpa kernels which will lead to numerical
mismatches.
- Fixed an issue in sdpa fp8 fprop kernels (in inference mode)

- Samples

- Added a new sample to showcase how a custom dropout mask can be
applied to a sdpa operation.
- Added a sample to shocase convolutions on large (`c * d * h * w > 2 **
31`) tensors.

v1.6.0

Toggle v1.6.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
v1.6.0 release

New API 
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.

v1.5.2

Toggle v1.5.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Release notes for cudnn-frontend 1.5.2: (#86)

[Enhancement] Allows stride value of 0 indicating repetition of tensor in those dimensions.

v1.5.1

Toggle v1.5.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Release notes for cudnn-frontend 1.5.1: (#84)

[Bug fix] Fixed an issue, where cudnn-frontend (1.5.0) when built with
cudnn version 9.1.1 and below, runs into issues when run with 9.2.0 and
above.

v1.5.0

Toggle v1.5.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Release notes for cudnn-frontend 1.5.0: (#81)

[New feature] With cudnn backend 9.2.0 and above, `Graph::check_support`
can determine support check for runtime engines without invoking the
nvrtc compiler. This allows users to check the support surface of cudnn
without invoking the nvrtc compilation.

[New feature] Python pip wheel now contains the necessary c++
development headers.

[New feature] Sliding window attention is now supported as an attribute
to the sdpa forward and bprop node. Usage:
`sdpa_attributes.set_sliding_window_length(window_length)`

[New feature] Bottom right aligned causal masking is now supported as an
attribute to the sdpa forward and bprop node. Usage:
`sdpa_attributes.use_causal_mask_bottom_right(true)`

[New feature] SDPA bprop attributes can choose deterministic algorithm
using the `use_deterministic_algorithm` API.

[New feature] Allow users to filter candidate execution plans of graph
by its shared memory usage in cudnn 9.2.0 and later.

[Bug fix] A runtime error if chosen execution plan candidate is
incorrectly set in the backend has been fixed. This would happen when
`check_support` does not correctly filter by the workspace size.

[Bug fix] selecting/deselecting by behavior and numerical notes has now
been fixed and works as intended.

[Debugging] A new tool for easy reproduction of a failure using the json
representation of the graph can be found [here](tools/json_reproducer).

[Samples] Restructured the cpp samples into categories for easier
navigation.

[Samples] Added a sample to showcase how different plans can be built in
parallel in separate threads.

[Compilation enhancement] Added a new macro
`CUDNN_FRONTEND_SKIP_NLOHMANN_JSON` as compilation flag to not have
nlohman::json as compilation dependency. Users lose access to certain
API functions like `print`, `key`, `serialize`, `deserialzie` that
depend on the library.

[Enhancement] Serialization of resample operation is now supported.

[Enhancement] Bug template has been added for new github issues

v1.4.0

Toggle v1.4.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[New] Added a benchmark folder which contains a sample docker file to (

…#73)

compare cudnn implementation of sdpa with that of the pytorch
implementation.

[Enhancement] Once an engine is de-selected by name, it will not be
built as part of check support.

[Enhancement] The cudnn backend search order for wheels is as follows:
(a) It will dlopen `libcudnn.so.MAJOR_VERSION` in the site packages. (b)
It will try to dlopen unversioned libcudnn.so. This way pypi cudnn
package nvidia-cudnn-cu* gets priority over default search path.

[Enhancement] Allow embedding dimension up to 256 (currently limited to
128) in sdpa fprop operation.

[Bug fix] Update the scale and bias shapes in batch norm sample.

v1.3.0

Toggle v1.3.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
cudnn frontend v1.3 release notes. (#72)

[New API] Added new operations `sdpa_fp8_forward` and `sdpa_fp8_backward` to perform scaled dot prodcut attention of fp8 tensors. See more details in the `docs/operations/Attention.md` and cpp sample in `samples/cpp/mha.cpp`. Pybinds for the fp8 nodes are also added.

[New API] Added new operation for resample forward operation. Add a new sample `samples/cpp/resample.cpp` to show its usage.

[New API] Add a new API `deselect_engines(std::vector<std::string> const &engine_names)` which blocks certain engine configs from running.

[New API] Add new APIs `select_numeric_notes` and `select_behavior_notes` to allow user select engine configs which have the selected numeric and behavior notes respectively.

[Python API] Added a custom exception `cudnnGraphNotSupportedException` to the python API to distinguish between graphs that are actually not supported as compared to programming errors.

[Python API] Added a new `backend_version_string` which returns the backend version in canonical form (eg. 9.1.0) instead of a version number.

[Bug Fix] Updated the workspace computation for sdpa fprop node. Previously, workspace was calculated for alibi slopes irrespective of whether alibi mask was turned on or not.

[Bug Fix] Fixed deserialization of pass by values of half precision.

v1.2.1

Toggle v1.2.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
cudnn frontend v1.2.1 release notes. (#69)

[Bug Fix] cudnn-frontend pip wheels will now dlopen the fully version
tag first `libucdnn.so.8` or `libcudnn.so.9` first before trying to load
`libcudnn.so`. This means the pip wheels in the RUN_PATH will be
prioritized over system paths (default behavior of dlopen). This can be
overridden by setting the `LD_LIBRARY_PATH`. Source installation will
now automatically look at cudnn in site packages before system path.

[Documentation] Fixed the google-colab links in the jupyter notebooks.

[Documentation] Added a jupyter notebook sample to go over the basics of
cudnn FE graph API.  `00_introduction.ipynb`