I ran Gemma 26B on 4GB VRAM + 16RAM. 15 t/s on avarage

Hi, i haven’t seen anybody do what title above says. Idk, maybe everyone nowadays do this already :) But if not, I want to show off a little. There are my specs

12th Gen Intel® Core™ i5-12450H (12)z

GPU 1: NVIDIA GeForce RTX 3050 Mobile [Discret]

GPU 2: Intel UHD Graphics @ 1.20 GHz [Integrat]

16GB RAM DDR4

Running on cachyos (arch linux), because on Windows, proven by my tests, speed is lower (Gemma 4 E4B 40t/s on linux, and and 30t/s on windows). I used UD-IQ4_NL quant version (13.4GB), as it seems like the best compromise between quality and size. Using ik_llama.cpp fork due to optimizations with MoE and CPU + GPU hybrid work. These are the flags i use “$LLAMA_SERVER”
-m “$MODEL_PATH”
-ngl 99
-c 8000
-fa on
-ctk iq4_nl
-ctv iq4_nl
–parallel 1
-nkvo
-t 8
-tb 8
-b 256
-ub 256
-rtr
-amb 512
–no-mmap
–jinja
-mla 2
–cpu-moe
–mlock
–reasoning off

so there is very little batch size. Even 512 causes OOM. Prefill can take time when context becomes bigger. Not all flags are actually doing something, i just tried everything i found that can help.
Doing the most - cpu-moe (offloading experts to ram), little batch size, and nkvo (offloading kv cache to ram).

Result(u can see token speed) on screenshot.

15t/s - MoE architecture saves the day!

As the result:

The chat quality is great. Facts are solid, instruction following great too
Model is bad on agentic tasks sadly

Great model on just medium class device with limited VRAM, and prove (at least to myself) that26B models don’t need 16GB VRAM to run PROPERLY.

The main problem now - is usable context window and prefill speed. On 8k the speed is 10t/s. Waiting for author of the ik_llama.cpp to implement turboquant to help solve the problem. Luckly he already works on that.

PS. tried running qwen3.6 35B. Again - the size is the main problem. Used Apex-i-mini version (14gb). It runs succesfully, speed is 20t/s, but quality is really bad. Will try to max out what i can on UD_IQ4_NL quantisation

UPD: UD_IQ4_NL too big, trying APEX-COMPACT

UPD 2: With a bit of tweaking here and there i balanced memory consumption on VRAM and RAM and APEX-COMPAT version of Qwen3.6 35B… attention… BLASTED with 30 tokens per second! That’s just wow. Now problem is that there is only 100mb left on RAM and i can’t even open the browser…

So for now, i connected to local server from my phone. And yeah - 30t/s. That’s crazy. But no room for context really… Need to figure something out…

Last update, and closing the theme: with qwen 3.6 35B i turned off the prompt cache. Haven’t noticed any difference in speed, but ram is kinda free now (at least 500-700mb). Maybe with turned on the speed would maintain better values, but who cares, cause i don’t have ram to run this big contexts. Final results: great quality answers, speed is 30t/s. Drops to 20 on 4k context. That’s kinda nuts. Now my laptop can be used as server to inference. No work on itself, tho. Waiting for more new quantisation technics (less models size, less kv cache size) and it will be even better.

I hope it was useful to anybody. Can’t wait to have Claude code in the pocket :)