Hi, i haven’t seen anybody do what title above says. Idk, maybe everyone nowadays do this already :) But if not, I want to show off a little. There are my specs
12th Gen Intel® Core™ i5-12450H (12)z
GPU 1: NVIDIA GeForce RTX 3050 Mobile [Discret]
GPU 2: Intel UHD Graphics @ 1.20 GHz [Integrat]
16GB RAM DDR4
Running on cachyos (arch linux), because on Windows, proven by my tests, speed is lower (Gemma 4 E4B 40t/s on linux, and and 30t/s on windows).
I used UD-IQ4_NL quant version (13.4GB), as it seems like the best compromise between quality and size.
Using ik_llama.cpp fork due to optimizations with MoE and CPU + GPU hybrid work.
These are the flags i use
“$LLAMA_SERVER”
-m “$MODEL_PATH”
-ngl 99
-c 8000
-fa on
-ctk iq4_nl
-ctv iq4_nl
–parallel 1
-nkvo
-t 8
-tb 8
-b 256
-ub 256
-rtr
-amb 512
–no-mmap
–jinja
-mla 2
–cpu-moe
–mlock
–reasoning off
so there is very little batch size. Even 512 causes OOM. Prefill can take time when context becomes bigger.
Not all flags are actually doing something, i just tried everything i found that can help.
Doing the most - cpu-moe (offloading experts to ram), little batch size, and nkvo (offloading kv cache to ram).
Result(u can see token speed) on screenshot.
15t/s - MoE architecture saves the day!
As the result:
- The chat quality is great. Facts are solid, instruction following great too
- Model is bad on agentic tasks sadly
Great model on just medium class device with limited VRAM, and prove (at least to myself) that26B models don’t need 16GB VRAM to run PROPERLY.
The main problem now - is usable context window and prefill speed. On 8k the speed is 10t/s. Waiting for author of the ik_llama.cpp to implement turboquant to help solve the problem. Luckly he already works on that.
PS. tried running qwen3.6 35B. Again - the size is the main problem. Used Apex-i-mini version (14gb). It runs succesfully, speed is 20t/s, but quality is really bad. Will try to max out what i can on UD_IQ4_NL quantisation
UPD: UD_IQ4_NL too big, trying APEX-COMPACT
These MoE models are great regarding speed. Half your 15T/s and you can run it entirely without a graphics card on an old computer. At least mine, which is several generations older manages to do 6-7 tokens a second, entirely on CPU. I guess that’s a bit slow for some agent to waste 1M tokens on some very basic programming project… But it’s enough to chat and ask questions, I guess?
yeah, 6-7 is slow (for me personally even for chat), but 15 feels great. Strange, but It can run even faster in generating progress. KV cache is hittin i guess.
I tried to create my own optimised version of coding agent and it even performes relatively good, but for programming it is surely slow. It would be ok, if it done all the code right from the first try, but it’s not. It is not the model problem - even cloud agents do mistakes, but due to high speed they can fix it fast.but for chat its great
It took me until now to finally dabble in these coding agents. And I didn’t realize at all how many tokens they burn through. I let it write some basic HTML & JavaScript browser game with some free OpenRouter model. I’ve done this before, just told a model to one-shot it in a single file. And now I tried OpenCode, let it ask me a few questions, come up with a plan and do an entire project structure… And it’s at one million tokens way faster than I thought. If my math is correct, that’d take my computer 2 days and nights straight at 6T/s 👀
Guess it’s really a bit (too) slow.
If u have any advice to run it better, i’ll appritiate that!
Your experience matches mine, it’s great to chat with, it was able to identify some paintings for me, but not great at agentic tasks.
I was hoping for an MOE 120b a3/4b model, but for 26b it’s great.

