Prerequisites
Feature Description
Generalize the kernel in mmq.cuh:
- Optionally use FP16 activations instead of q8_1.
- Implement function for loading FP16 data from
src0.
- Implement function for FP16, FP16 -> FP32 matrix multiplication.
- Move and rename some parts of the code to different headers to organize it better, this should be done in a follow-up PR.
Same implementation can be done for BF16 and FP32, for FP8 and NVFP4 we should first standardize how to treat them in ggml and add CPU support.
This feature should be tackled by experienced CUDA programmers only.
Motivation
I wrote a matrix multiplication CUDA kernel for quantized data that currently resides mmq.cuh. I wrote it because it allows for a fused matrix multiplication using custom data types since a library like cuBLAS only supports standard data types like FP16. I don't think it would be worthwhile to try and compete with cuBLAS for such data types when it comes to conventional GEMM. However, I at some point added optimizations for MoE models which I think will result in better performance than cuBLAS.
Possible Implementation
No response
Prerequisites
Feature Description
Generalize the kernel in
mmq.cuh:src0.Same implementation can be done for BF16 and FP32, for FP8 and NVFP4 we should first standardize how to treat them in ggml and add CPU support.
This feature should be tackled by experienced CUDA programmers only.
Motivation
I wrote a matrix multiplication CUDA kernel for quantized data that currently resides
mmq.cuh. I wrote it because it allows for a fused matrix multiplication using custom data types since a library like cuBLAS only supports standard data types like FP16. I don't think it would be worthwhile to try and compete with cuBLAS for such data types when it comes to conventional GEMM. However, I at some point added optimizations for MoE models which I think will result in better performance than cuBLAS.Possible Implementation
No response