Feature Request: generalize MMQ CUDA kernel for floating-point data

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Generalize the kernel in `mmq.cuh`:

* Optionally use FP16 activations instead of q8_1.
* Implement function for loading FP16 data from `src0`.
* Implement function for FP16, FP16 -> FP32 matrix multiplication.
* Move and rename some parts of the code to different headers to organize it better, this should be done in a follow-up PR.

Same implementation can be done for BF16 and FP32, for FP8 and NVFP4 we should first standardize how to treat them in ggml and add CPU support.

This feature should be tackled by experienced CUDA programmers only.

### Motivation

I wrote a matrix multiplication CUDA kernel for quantized data that currently resides `mmq.cuh`. I wrote it because it allows for a fused matrix multiplication using custom data types since a library like cuBLAS only supports standard data types like FP16. I don't think it would be worthwhile to try and compete with cuBLAS for such data types when it comes to conventional GEMM. However, I at some point added optimizations for MoE models which I think will result in better performance than cuBLAS.

### Possible Implementation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: generalize MMQ CUDA kernel for floating-point data #18864

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: generalize MMQ CUDA kernel for floating-point data #18864

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions