System & Software Stack
- Hardware: ASUS Zenbook 15 UM3504DA | AMD Ryzen 7 7735U (8C/16T) | Radeon 680M iGPU (512 MB BIOS-limited VRAM) | 32 GB LPDDR5 RAM
- OS: CachyOS (Arch Linux) | Wayland + Niri compositor
- Runtime:
llama.cppcustom Vulkan build |llama-serverwith preset routing - Deployment Scope: Single-user local inference | 2–3 year static configuration window
Build Configuration
The binary is compiled with hardware-aware optimizations and server/tooling support. Each flag addresses a specific constraint or capability of the target platform.
cmake .. \
-DGGML_NATIVE=ON \
-DGGML_OPENMP=ON \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_TOOLS=ON
| Flag | Purpose | Measured Impact |
|---|---|---|
GGML_NATIVE=ON |
Enables CPU-specific ISA extensions (AVX2/AVX512) | +10–15% prompt throughput on Zen 3+ cores |
GGML_OPENMP=ON |
Parallelizes prompt processing across available cores | Required for batched CPU inference |
GGML_VULKAN=ON |
GPU acceleration backend | Mandatory for Rembrandt iGPU. ROCm unsupported. CUDA inapplicable. |
CMAKE_INTERPROCEDURAL_OPTIMIZATION=ON |
Link-time optimization | Reduces binary size, improves instruction cache locality |
DLLAMA_BUILD_SERVER=ON |
Compiles HTTP server with OpenAI-compatible API | Enables remote UI and agent routing |
DLLAMA_BUILD_TOOLS=ON |
Enables structured function calling | Required for agentic task execution |
Server Launch & Routing Architecture
The server is invoked with strict resource controls to prevent memory thrashing on constrained hardware:
llama-server --port 8080 --host 0.0.0.0 \
--models-preset /mnt/data/ai/models.ini \
--models-max 1 \
--tools all
--models-max 1: Enforces single-model residency. Prevents concurrent RAM/GTT allocation spikes.--models-preset: Loads declarative INI configuration for deterministic parameter application.--tools all: Activates full OpenAI-compatible tool/schema support for agent workflows.- Port 8080 bound to all interfaces for integration with local UIs (OpenWebUI, Helium) and routing scripts.
Configuration Architecture (models.ini)
The preset system uses a global-defaults + per-model-override structure. This eliminates runtime flag management, ensures baseline stability across all workloads, and allows precise parameter alignment per model architecture.
version = 1
[*]
; Global defaults - CPU-optimized baseline
seed = -1
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0
repeat-penalty = 1.1
jinja = true
batch-size = 256
ubatch-size = 256
threads = 8
threads-batch = 8
cpu-range = 0-7
cpu-strict = 1
kv-offload = false
defrag-thold = 0.1
poll = 25
poll-batch = 50
cpu-moe = true
gpu-layers = 0
ctx-size = 16384
Global defaults prioritize CPU affinity, strict thread binding, MoE routing on CPU, and conservative KV cache management. Per-model sections override only the parameters required for their specific workload profile.
Per-Model Profiles & Parameter Rationale
Quick Reasoning: gemma-4-e4b
[gemma-4-e4b]
model = /mnt/data/models/daily/google_gemma-4-E4B-it-Q4_K_M.gguf
temperature = 0.7
reasoning-budget = 256
gpu-layers = 32
ctx-size = 32768
- Purpose: Low-latency code completion, rapid drafting, lightweight Q&A.
- Rationale: 4B MoE fits entirely within GPU offload limits. Extended context (32K) enables long-file navigation.
reasoning-budget = 256constrains chain-of-thought to prevent token waste.temperature = 0.7maintains creative variance for ideation tasks.
General Purpose: gemma-4-26b (Daily Driver)
[gemma-4-26b]
model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-IQ4_NL.gguf
temperature = 0.65
repeat-penalty = 1.05
reasoning-budget = 512
batch-size = 512
ubatch-size = 512
defrag-thold = 0.05
gpu-layers = 18
- Purpose: Primary conversational, analytical, and long-form generation workload.
- Rationale: Heavily optimized for sustained throughput and thermal stability. Detailed parameters documented in the following section.
Agentic Router: qwen3.5-9b
[qwen3.5-9b]
model = /mnt/data/models/daily/Qwen_Qwen3.5-9B-Q4_K_M.gguf
temperature = 0.65
top-k = 25
repeat-penalty = 1.05
- Purpose: Function calling, tool selection, structured API routing.
- Rationale:
top-k = 25narrows sampling distribution to improve tool-call determinism. Reducedrepeat-penaltyprevents schema repetition loops. Global CPU defaults apply to minimize latency during routing.
Complex Reasoning: qwen3.6-35b
[qwen3.6-35b]
model = /mnt/data/models/daily/Qwen_Qwen3.6-35B-A3B-Q3_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
ctx-size = 8192
- Purpose: Deep analysis, multi-step reasoning, constrained exploration.
- Rationale: 35B MoE requires memory safety limits.
ctx-size = 8192prevents GTT saturation.presence-penalty = 0.8forces lexical diversity during long-form generation.reasoning-budget = 256maintains structured output without unbounded context accumulation.
Experimental: lfm2-24b
[lfm2-24b]
model = /mnt/data/models/experimental/LFM2-24B-A2B-Q4_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
- Purpose: Architecture evaluation, quantization testing, parameter isolation.
- Rationale: Mirrors 35B safety guardrails. Kept separate from daily workflows to prevent context contamination or parameter bleed during testing.
Primary Model Optimization: Gemma-4-26B
The 26B MoE profile represents the core optimization target. Parameter selection resulted from systematic empirical testing across offload depth, batch sizing, cache management, and thermal behavior.
| Parameter | Value | Rationale |
|---|---|---|
gpu-layers |
18 | Measured efficiency sweet spot. Beyond 18 layers, GTT usage exceeds 9.8 GB with diminishing returns (+0.15 t/s per layer). |
batch-size / ubatch-size |
512 | Increased from global 256. Matches prompt throughput requirements without exceeding KV cache limits. |
defrag-thold |
0.05 | Aggressive KV cache defragmentation prevents memory fragmentation during long sessions. |
threads |
6 (override) | Reduced from global 8. Maintains baseline CPU activity to trigger firmware fan curves during GPU-heavy inference. |
reasoning-budget |
512 | Enforces structured chain-of-thought. Improves cache locality and prevents context bloat. |
temperature / repeat-penalty |
0.65 / 1.05 | Balances coherence with lexical variation. Lower repeat penalty prevents over-penalization in technical prose. |
Measured Performance:
- CPU-only (0 layers): 9.9 t/s generation | 0 GB GTT
- 18-layer offload: 16.9 t/s generation | 9.8 GB GTT | Stable >50 hrs
- 24-layer offload: 18.6 t/s generation | 12.6 GB GTT | Marginal stability
- Real-world 2,090-token response: 116s → 68s (40% reduction)
Hardware Constraints & Empirical Findings
- VRAM Limitation: BIOS locks dedicated VRAM to 512 MB. GPU offloading immediately utilizes GTT (system RAM mapped as VRAM).
amdgpu_topconfirms usable VRAM caps at ~450 MB. - Offloading Diminishing Returns: Layers 0–6 yield +0.73 t/s per layer. Layers 6–18 yield +0.15–0.33 t/s per layer. Layers 18–24 yield +0.15–0.28 t/s per layer with >0.5 GB GTT cost per layer. Stability degrades past 20 layers.
- Thermal Firmware Constraint: ASUS fan curves respond exclusively to CPU load. GPU-only inference bypasses thermal regulation.
threads = 6ensures consistent CPU activity to maintain airflow. - Context Scaling: Generation throughput drops ~27% at 40% context fill due to O(n) attention scanning. 16K context is the practical ceiling for 32 GB RAM with 20B+ models.
- Reasoning Budget as Cache Optimizer: Enforcing explicit reasoning tokens structures KV cache layout, reduces attention fragmentation, and prevents unbounded context accumulation during long sessions.
Deployment Parameters
This configuration is locked for sustained single-user deployment. No dynamic context routing, no concurrent model loading, no over-engineered orchestration.
Locked Baseline:
- Vulkan + OpenMP + Native ISA compilation
- Global CPU defaults with per-model parameter overrides
gemma-4-26bat 18 layers (16.9 t/s, 9.8 GB GTT)gemma-4-e4bat 32 layers (full GPU offload)ctx-size = 16384default, model-specific reductions where requiredthreads = 6on 26B profile for thermal regulation- Single-model residency enforced via
--models-max 1
The stack delivers deterministic throughput, stable memory residency, and predictable thermal behavior within hardware constraints. Configuration changes are restricted to model quantization updates or hardware replacement.


No not really. Smaller models tend to be worse with mode complex topics, so you would have to check it more if you are programming with it. You could just use one small model in a coding assistant and that can work for some stuff. The coding agent would behave the same the result might be worse. The context itself does not have to be narrowed for a smaller model. Usually they can process the same context size even faster than the larger model and you would even have space for more context because the model is smaller, but they might need more smaller steps and checking in between to make sure they don’t introduce errors that get worse with time. Smaller models also tend to be less reliable when it comes to calling tools (they more often make mistakes in parameter formats etc.). You can still have a lot of fun with smaller models and depending on the task they might also be sufficient.