You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Matthew Nicely edited this page May 15, 2022
·
3 revisions
CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels,
they exhibit performance comparable to cuBLAS for scalar GEMM
computations. The above figure shows CUTLASS performance relative to cuBLAS
for large matrix dimensions on an NVIDIA A100,
an NVIDIA A2,
an NVIDIA TitanV,
and an NVIDIA GeForce 2080 Ti
compiled with the CUDA 11.5 Toolkit. Tensor Core operations are implemented using CUDA's
mma instruction.