Skip to content

RPC implementation for Koboldcpp#2118

Draft
Neresco wants to merge 25 commits intoLostRuins:concedofrom
Neresco:rpc-testing-function-11-04-2026
Draft

RPC implementation for Koboldcpp#2118
Neresco wants to merge 25 commits intoLostRuins:concedofrom
Neresco:rpc-testing-function-11-04-2026

Conversation

@Neresco
Copy link
Copy Markdown

@Neresco Neresco commented Apr 11, 2026

Good evening,

with the help of AI i have ported the RPC function from llama.cpp into Koboldcpp.
To be me more clear of this i am no programmer and i could not have done this without AI.
There is no UI implementation for it for the reason i do not know how to do it or get started.

It works with "vulkan only" at the moment.
Like Concedo suggested i should check if this interferes with parts of the code.
I had to do this with AI too the output of my used model reported to me there is none.
So if there a i must admit i rely on real human programmers for that.

There 3 additional files created as manual:
RPC_PORTING_GUIDE.md = To get another human or AI port this later again or compare
RPC_QUICKSTART.md = For a quick start to this feature in usage
RPC_MANUAL.md = With more explanations, usage cases and possible errors that can occur and why.

Edit: the License.md Files were deleted by my used AI, this part should be not accepted i guess?

Edit-2: The "Checking"
✅ Impact Analysis Complete
I've thoroughly analyzed our RPC implementation changes. Here's the summary:

✅ CLEAN ADDITIONS - No Breaking Changes
Our changes are well-segmented and don't interfere with other functionality:

  1. Isolated Code Paths
    RPC library loading is in its own elif branch (line 1021-1024)
    RPC argument is in mutually_exclusive_group with CUDA/Vulkan/CPU
    C++ device ordering is in conditional block (only runs when --device specified)
  2. No Conflicts With Other Components
    ✅ Embeddings adapter: Doesn't use RPC fields
    ✅ TTS adapter: Doesn't use RPC fields
    ✅ CUDA/Vulkan/HIP: Mutually exclusive with RPC
    ✅ GUI: No changes (RPC is CLI-only)
    ✅ Tensor split: Enhanced but backward compatible
  3. Backward Compatible
    ✅ Without --rpc: Uses normal code paths
    ✅ Without --device: Uses automatic ordering
    ✅ Existing functionality unchanged
  4. Minor Concern (Already Handled)
    kcpp_parse_device_list() in utils.cpp doesn't support RPC devices
    Impact: None - RPC uses its own device ordering in gpttype_adapter.cpp
    Only affects: embeddings/TTS if they try to use RPC (unlikely use case)
  5. Testing Checklist
    All these should work:

✅ Normal model loading (without RPC)
✅ Vulkan model loading
✅ CUDA model loading
✅ Embeddings generation
✅ TTS generation
✅ RPC-only mode
✅ Hybrid mode (RPC + local)
✅ Device ordering
✅ Tensor split with RPC
✅ CONCLUSION: SAFE FOR PR
The implementation follows the quoted best practice:

"Uhh mostly keeping functional changes well segmented so they don't break or interfere with other stuff"

Our changes are:

Well-segmented in isolated branches
No interference with other components
Clean separation of concerns
Backward compatible
Optional build feature (LLAMA_RPC=1)
Recommendation: ✅ Ready for GitHub PR

@LostRuins LostRuins marked this pull request as draft April 11, 2026 18:52
@LostRuins
Copy link
Copy Markdown
Owner

i'll try to see what i can gather from this, but i can't merge a 13k diff to koboldcpp.py :\

@weenachuangkud
Copy link
Copy Markdown

weenachuangkud commented Apr 11, 2026

Wow cool!

Maybe you can use imGui or something like that for the UI implementation?

@Neresco
Copy link
Copy Markdown
Author

Neresco commented Apr 12, 2026

The koboldcpp.py file is now more of the original codingstyle my test show it does not load the model anymore over rpc -.-

I am working on it to make it functional again.

python koboldcpp.py --model /home/lunarbuntu/Downloads/Qwen3.5-397B-A17B-K_G_2.93.gguf --rpc 192.168.1.101:50054 --device VULKAN0,RPC0,RPC1,RPC2,VULKAN1 --tensor_split 13 14 11 8 54 --gpulayers 999 --port 5001 --contextsize 262144 --quiet --hordemodelname Qwen3.5-397B-A17B-K_G_2.93 --mmproj /home/lunarbuntu/Downloads/mmproj-F32.gguf --highpriority --batch-size 1024
usage: koboldcpp.py [-h] [--model [filenames] [[filenames] ...]] [--port [portnumber]]
[--host [ipaddr]] [--launch] [--config [filename]] [--threads [threads]]
[--usecuda [[main GPU ID] [mmq|nommq] [rowsplit] ...]]
[--usevulkan [[Device IDs] ...]] [--usecpu] [--contextsize [256 to 262144]]
[--gpulayers [[GPU layers]]] [--tensor_split [Ratios] [[Ratios] ...]]
[--autofit] [--version] [--analyze [filename]] [--maingpu [Device ID]]
[--batchsize {-1,16,32,64,128,256,512,1024,2048,4096}]
[--blasthreads [threads]] [--lora [lora_filename] [[lora_filename] ...]]
[--loramult [amount]] [--noshift] [--nofastforward] [--useswa]
[--smartcache [limit]] [--ropeconfig [rope-freq-scale] [[rope-freq-base] ...]]
[--overridenativecontext [trained context]] [--usemmap] [--usemlock] [--noavx2]
[--failsafe] [--debugmode [DEBUGMODE]] [--onready [shell command]]
[--benchmark [[filename]]] [--prompt [prompt]] [--cli]
[--genlimit [token limit]] [--multiuser [limit]] [--multiplayer] [--websearch]
[--remotetunnel] [--highpriority] [--foreground] [--preloadstory [savefile]]
[--savedatafile [savefile]] [--quiet] [--ssl [cert_pem] [[key_pem] ...]]
[--nocertify] [--mmproj [filename]] [--mmprojcpu] [--visionmaxres [max px]]
[--draftmodel [filename]] [--draftamount [tokens]] [--draftgpulayers [layers]]
[--draftgpusplit [Ratios] [[Ratios] ...]] [--password [API key]]
[--ratelimit [seconds]] [--ignoremissing] [--chatcompletionsadapter [filename]]
[--jinja] [--jinja_tools] [--jinja_kwargs {"parameter":"value",...}]
[--noflashattention] [--lowvram] [--quantkv [quantization level 0/1/2]]
[--smartcontext] [--unpack destination] [--exportconfig [filename]]
[--exporttemplate [filename]] [--nomodel] [--moeexperts [num of experts]]
[--moecpu [[layers affected]]] [--defaultgenamt DEFAULTGENAMT] [--nobostoken]
[--enableguidance] [--maxrequestsize [size in MB]]
[--overridekv [name=type:value]]
[--overridetensors [tensor name pattern=buffer type]] [--showgui |
--skiplauncher] [--singleinstance] [--nopipelineparallel]
[--gendefaults {"parameter":"value",...}] [--gendefaultsoverwrite]
[--mcpfile [mcp json file]] [--device <dev1,dev2,..>]
[--downloaddir [directory]] [--autofitpadding [padding in MB]]
[--hordemodelname [name]] [--hordeworkername [name]] [--hordekey [apikey]]
[--hordemaxctx [amount]] [--hordegenlen [amount]] [--sdmodel [filename]]
[--sdthreads [threads]] [--sdclamped [[maxres]]] [--sdclampedsoft [maxres]]
[--sdt5xxl [filename]] [--sdclip1 [filename]] [--sdclip2 [filename]]
[--sdphotomaker [filename]] [--sdupscaler [filename]] [--sdflashattention]
[--sdoffloadcpu] [--sdvaecpu] [--sdclipgpu] [--sdconvdirect {off,vaeonly,full}]
[--sdvae [filename] | --sdvaeauto] [--sdquant [[quantization level 0/1/2]] |
--sdlora [filename] [[filename] ...]] [--sdloramult [amounts] [[amounts] ...]]
[--sdtiledvae [maxres]] [--sdmaingpu [Device ID]] [--whispermodel [filename]]
[--ttsmodel [filename]] [--ttswavtokenizer [filename]] [--ttsgpu]
[--ttsmaxlen TTSMAXLEN] [--ttsthreads [threads]] [--ttsdir [directory]]
[--musicllm [filename]] [--musicembeddings [filename]]
[--musicdiffusion [filename]] [--musicvae [filename]] [--musiclowvram]
[--embeddingsmodel [filename]] [--embeddingsmaxctx [amount]] [--embeddingsgpu]
[--admin] [--adminpassword [password]] [--admindir [directory]]
[--adminunloadtimeout ADMINUNLOADTIMEOUT] [--routermode] [--autoswapmode]
[model_param] [port_param]
koboldcpp.py: error: argument model_param: not allowed with argument --model/-m

@Neresco
Copy link
Copy Markdown
Author

Neresco commented Apr 12, 2026

I am sorry i am unable to change the files back into the coding style you use.

I have tried with AI and manually the last six hours.
The only i think i am able to do is to make it broken.

If you cannot use or recycle this, it's ok i will remove my PR.
Edit: It is now back in the original PR working state.

Edit 2: Right now i give it another go with AI restructuring. I maybe have found a way to do it.
It will take many hours by 17,7k lines of code. I force the model to compare and restructure 200 lines step by step.

@Neresco
Copy link
Copy Markdown
Author

Neresco commented Apr 13, 2026

Restructuring complete just 9k differences remain... Sadface.

@Neresco
Copy link
Copy Markdown
Author

Neresco commented Apr 14, 2026

grafik I have seen right now there are issues with my PR. I wanted to start it over the ui and just rpc are available.

@LostRuins
Copy link
Copy Markdown
Owner

I do think RPC is a good thing to have but in this current state unfortunately this PR still isn't mergeable due to the huge number of diffs and changes in many files. I'll leave the draft up for reference for now as it's a good reference, but ideally if/when we do add RPC as a backend it'll need to be integrated cleanly. Also there are a bunch of artifacts and binaries attached to the PR as well.

Looking through, it does seem like the implementation is surprisingly simple at it's core. We pass the RPC IP address to the backend and everything works automagically, which, if it works, would be quite surprising. I wonder if RPC can be combined with individual GPU accelerators for each node as well, i.e. could you stack Metal + Vulkan + CUDA from 3 different systems?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants