Skip to content

Feature Request: generalize MMQ CUDA kernel for floating-point data #18864

@JohannesGaessler

Description

@JohannesGaessler

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Generalize the kernel in mmq.cuh:

  • Optionally use FP16 activations instead of q8_1.
  • Implement function for loading FP16 data from src0.
  • Implement function for FP16, FP16 -> FP32 matrix multiplication.
  • Move and rename some parts of the code to different headers to organize it better, this should be done in a follow-up PR.

Same implementation can be done for BF16 and FP32, for FP8 and NVFP4 we should first standardize how to treat them in ggml and add CPU support.

This feature should be tackled by experienced CUDA programmers only.

Motivation

I wrote a matrix multiplication CUDA kernel for quantized data that currently resides mmq.cuh. I wrote it because it allows for a fused matrix multiplication using custom data types since a library like cuBLAS only supports standard data types like FP16. I don't think it would be worthwhile to try and compete with cuBLAS for such data types when it comes to conventional GEMM. However, I at some point added optimizations for MoE models which I think will result in better performance than cuBLAS.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions