Metal implementatio for the trellis quants. by ikawrakow · Pull Request #475 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-05-30T15:58:48Z

IQ2_KT and IQ3_KT work. IQ2_KT has a pretty decent performance.

~~IQ4_KT is not working, so a draft PR for now.~~

IQ4_KT is disabled for now as there is a bug that I don't find.

Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B

Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B. Flipping signs is a costly affair.

* iq2_kt: Metal dequantize * iq2_kt: Metal GEMV Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B * iq3_kt: Metal dequantize * iq3_kt: Metal GEMV Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B. Flipping signs is a costly affair. * iq4_kt: Metal dequantize - getting NaNs * iq4_kt: Metal GEMV - also not working * iq4_kt: Metal still not working * Disable iq4_kt on Metal for now --------- Trellis quants: faster CPU prompt processing (ikawrakow#482) * Experimenting with dequant + f32 GEMM For iq4_kt this results in a massive PP improvement from PP512 = ~42 t/s to PP512 = 128 t/s. * Experimenting with dequant + f32 GEMM iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s * Experimenting with dequant + f16 GEMM on NEON iq2_kt: PP512 = 79 t/s from 42 t/s iq3_kt: PP512 = 81 t/s from 35 t/s Also, found the reason why the f16 implementation for iq4_kt was not working: it overflows. It works after mltiplying with the row scale before doing the multiply-adds. * Experimenting with dequant + f16 GEMM on NEON iq4_kt: PP512 = 86 t/s from 29 t/s * Minor --------- Minor (~2%) iq2_ks TG performance improvement on CUDA (ikawrakow#468) Direct conversion from fp16 to Q6_0

Iwan Kawrakow added 8 commits May 30, 2025 07:52

iq2_kt: Metal dequantize

983844e

iq2_kt: Metal GEMV

eeeca31

Performance is actually quite decent: 52 t/s on my M2-Max for LlaMA-3.1-8B

iq3_kt: Metal dequantize

2396cc3

iq3_kt: Metal GEMV

ad52554

Performance is not as good as iq2_kt: 40 t/s on my M2-Max for LlaMA-3.1-8B. Flipping signs is a costly affair.

iq4_kt: Metal dequantize - getting NaNs

07663b2

iq4_kt: Metal GEMV - also not working

b687b2b

iq4_kt: Metal still not working

d14eb93

Disable iq4_kt on Metal for now

a7fa24a

ikawrakow marked this pull request as ready for review June 1, 2025 12:22

ikawrakow merged commit 35374bc into main Jun 1, 2025

ikawrakow mentioned this pull request Jun 1, 2025

Trellis quantization #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal implementatio for the trellis quants.#475

Metal implementatio for the trellis quants.#475
ikawrakow merged 8 commits intomainfrom
ik/trellis_metal

ikawrakow commented May 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikawrakow commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ikawrakow commented May 30, 2025 •

edited

Loading