Auto-generation of CUTLASS Extension Kernel Templates (pytorch#2932)

Summary: X-link: facebookresearch/FBGEMM#33 Pull Request resolved: pytorch#2932 This diff allows cutlass_extension to use configuration-based auto-instance generation. The diff aims to achieve the following : (a) Many kernels needs to be instanced varying the template arguments and it is hard to instance them all by hand. (b) Use and extend OSS NVIDIA scripts for FBGEMM (Meta AI) use cases. (c) Confirm with CUTLASS's device-side API to allow use to perturb all the template parameters that CUTLASS allows. (d) The bullet (b) and (c) allows us to bring our internal usage close to the NVIDIA/CUTLASS and we can upstream our kernels quickly to NVIDIA/CUTLASS repo. Reviewed By: ipiszy Differential Revision: D60171966 fbshipit-source-id: 8dfd80223a7c40c79446a50b93c87bf339e7596a
22quinn · Aug 26, 2024 · de845bf · de845bf
1 parent d693267
commit de845bf
Show file tree

Hide file tree

Showing 22 changed files with 1 addition and 2,656 deletions.
diff --git a/fbgemm_gpu/experimental/gen_ai/bench/quantize_ops.py b/fbgemm_gpu/experimental/gen_ai/bench/quantize_ops.py
@@ -514,7 +514,7 @@ def quantize(self, x, w):
         return xq, wq, x_scale, w_scale
 
     def compute(self, xq, wq, x_scale, w_scale):
-        return torch.ops.fbgemm.f8f8bf16_v2(xq, wq, x_scale * w_scale)
+        return torch.ops.cutlass_extensions.f8f8bf16(xq, wq, x_scale * w_scale)
 
     def quantize_and_compute(self, x, w):
         xq, wq, x_scale, w_scale = self.quantize(x, w)

diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_v2.cu b/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_v2.cu
diff --git a/...emm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu b/...emm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu