Skip to content

CUTLASS 3.1

Compare
Choose a tag to compare
@hwu36 hwu36 released this 24 May 20:10
· 199 commits to main since this release
6f47420
  • New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
  • New efficient epilogues using TMA for Hopper.
  • Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
  • New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
  • An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
  • Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
  • Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
  • Performance optimizations for the warp-specialized persistent ping-pong kernel.
  • Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
  • FMHA Backward Pass from Meta xFormers.
  • Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
  • Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
  • Batched Strided GEMV support both row major and column major input matrix.
  • Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
  • Row Broadcast can be fused in the epilogue.
  • The GitHub branch is renamed from master to main in this release.
  • Optimal performance using CUDA 12.1
  • Updates and bugfixes from the community (thanks!)