System & Software Stack
- Hardware: ASUS Zenbook 15 UM3504DA | AMD Ryzen 7 7735U (8C/16T) | Radeon 680M iGPU (512 MB BIOS-limited VRAM) | 32 GB LPDDR5 RAM
- OS: CachyOS (Arch Linux) | Wayland + Niri compositor
- Runtime:
llama.cppcustom Vulkan build |llama-serverwith preset routing - Deployment Scope: Single-user local inference | 2–3 year static configuration window
Build Configuration
The binary is compiled with hardware-aware optimizations and server/tooling support. Each flag addresses a specific constraint or capability of the target platform.
cmake .. \
-DGGML_NATIVE=ON \
-DGGML_OPENMP=ON \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_TOOLS=ON
| Flag | Purpose | Measured Impact |
|---|---|---|
GGML_NATIVE=ON |
Enables CPU-specific ISA extensions (AVX2/AVX512) | +10–15% prompt throughput on Zen 3+ cores |
GGML_OPENMP=ON |
Parallelizes prompt processing across available cores | Required for batched CPU inference |
GGML_VULKAN=ON |
GPU acceleration backend | Mandatory for Rembrandt iGPU. ROCm unsupported. CUDA inapplicable. |
CMAKE_INTERPROCEDURAL_OPTIMIZATION=ON |
Link-time optimization | Reduces binary size, improves instruction cache locality |
DLLAMA_BUILD_SERVER=ON |
Compiles HTTP server with OpenAI-compatible API | Enables remote UI and agent routing |
DLLAMA_BUILD_TOOLS=ON |
Enables structured function calling | Required for agentic task execution |
Server Launch & Routing Architecture
The server is invoked with strict resource controls to prevent memory thrashing on constrained hardware:
llama-server --port 8080 --host 0.0.0.0 \
--models-preset /mnt/data/ai/models.ini \
--models-max 1 \
--tools all
--models-max 1: Enforces single-model residency. Prevents concurrent RAM/GTT allocation spikes.--models-preset: Loads declarative INI configuration for deterministic parameter application.--tools all: Activates full OpenAI-compatible tool/schema support for agent workflows.- Port 8080 bound to all interfaces for integration with local UIs (OpenWebUI, Helium) and routing scripts.
Configuration Architecture (models.ini)
The preset system uses a global-defaults + per-model-override structure. This eliminates runtime flag management, ensures baseline stability across all workloads, and allows precise parameter alignment per model architecture.
version = 1
[*]
; Global defaults - CPU-optimized baseline
seed = -1
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0
repeat-penalty = 1.1
jinja = true
batch-size = 256
ubatch-size = 256
threads = 8
threads-batch = 8
cpu-range = 0-7
cpu-strict = 1
kv-offload = false
defrag-thold = 0.1
poll = 25
poll-batch = 50
cpu-moe = true
gpu-layers = 0
ctx-size = 16384
Global defaults prioritize CPU affinity, strict thread binding, MoE routing on CPU, and conservative KV cache management. Per-model sections override only the parameters required for their specific workload profile.
Per-Model Profiles & Parameter Rationale
Quick Reasoning: gemma-4-e4b
[gemma-4-e4b]
model = /mnt/data/models/daily/google_gemma-4-E4B-it-Q4_K_M.gguf
temperature = 0.7
reasoning-budget = 256
gpu-layers = 32
ctx-size = 32768
- Purpose: Low-latency code completion, rapid drafting, lightweight Q&A.
- Rationale: 4B MoE fits entirely within GPU offload limits. Extended context (32K) enables long-file navigation.
reasoning-budget = 256constrains chain-of-thought to prevent token waste.temperature = 0.7maintains creative variance for ideation tasks.
General Purpose: gemma-4-26b (Daily Driver)
[gemma-4-26b]
model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-IQ4_NL.gguf
temperature = 0.65
repeat-penalty = 1.05
reasoning-budget = 512
batch-size = 512
ubatch-size = 512
defrag-thold = 0.05
gpu-layers = 18
- Purpose: Primary conversational, analytical, and long-form generation workload.
- Rationale: Heavily optimized for sustained throughput and thermal stability. Detailed parameters documented in the following section.
Agentic Router: qwen3.5-9b
[qwen3.5-9b]
model = /mnt/data/models/daily/Qwen_Qwen3.5-9B-Q4_K_M.gguf
temperature = 0.65
top-k = 25
repeat-penalty = 1.05
- Purpose: Function calling, tool selection, structured API routing.
- Rationale:
top-k = 25narrows sampling distribution to improve tool-call determinism. Reducedrepeat-penaltyprevents schema repetition loops. Global CPU defaults apply to minimize latency during routing.
Complex Reasoning: qwen3.6-35b
[qwen3.6-35b]
model = /mnt/data/models/daily/Qwen_Qwen3.6-35B-A3B-Q3_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
ctx-size = 8192
- Purpose: Deep analysis, multi-step reasoning, constrained exploration.
- Rationale: 35B MoE requires memory safety limits.
ctx-size = 8192prevents GTT saturation.presence-penalty = 0.8forces lexical diversity during long-form generation.reasoning-budget = 256maintains structured output without unbounded context accumulation.
Experimental: lfm2-24b
[lfm2-24b]
model = /mnt/data/models/experimental/LFM2-24B-A2B-Q4_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
- Purpose: Architecture evaluation, quantization testing, parameter isolation.
- Rationale: Mirrors 35B safety guardrails. Kept separate from daily workflows to prevent context contamination or parameter bleed during testing.
Primary Model Optimization: Gemma-4-26B
The 26B MoE profile represents the core optimization target. Parameter selection resulted from systematic empirical testing across offload depth, batch sizing, cache management, and thermal behavior.
| Parameter | Value | Rationale |
|---|---|---|
gpu-layers |
18 | Measured efficiency sweet spot. Beyond 18 layers, GTT usage exceeds 9.8 GB with diminishing returns (+0.15 t/s per layer). |
batch-size / ubatch-size |
512 | Increased from global 256. Matches prompt throughput requirements without exceeding KV cache limits. |
defrag-thold |
0.05 | Aggressive KV cache defragmentation prevents memory fragmentation during long sessions. |
threads |
6 (override) | Reduced from global 8. Maintains baseline CPU activity to trigger firmware fan curves during GPU-heavy inference. |
reasoning-budget |
512 | Enforces structured chain-of-thought. Improves cache locality and prevents context bloat. |
temperature / repeat-penalty |
0.65 / 1.05 | Balances coherence with lexical variation. Lower repeat penalty prevents over-penalization in technical prose. |
Measured Performance:
- CPU-only (0 layers): 9.9 t/s generation | 0 GB GTT
- 18-layer offload: 16.9 t/s generation | 9.8 GB GTT | Stable >50 hrs
- 24-layer offload: 18.6 t/s generation | 12.6 GB GTT | Marginal stability
- Real-world 2,090-token response: 116s → 68s (40% reduction)
Hardware Constraints & Empirical Findings
- VRAM Limitation: BIOS locks dedicated VRAM to 512 MB. GPU offloading immediately utilizes GTT (system RAM mapped as VRAM).
amdgpu_topconfirms usable VRAM caps at ~450 MB. - Offloading Diminishing Returns: Layers 0–6 yield +0.73 t/s per layer. Layers 6–18 yield +0.15–0.33 t/s per layer. Layers 18–24 yield +0.15–0.28 t/s per layer with >0.5 GB GTT cost per layer. Stability degrades past 20 layers.
- Thermal Firmware Constraint: ASUS fan curves respond exclusively to CPU load. GPU-only inference bypasses thermal regulation.
threads = 6ensures consistent CPU activity to maintain airflow. - Context Scaling: Generation throughput drops ~27% at 40% context fill due to O(n) attention scanning. 16K context is the practical ceiling for 32 GB RAM with 20B+ models.
- Reasoning Budget as Cache Optimizer: Enforcing explicit reasoning tokens structures KV cache layout, reduces attention fragmentation, and prevents unbounded context accumulation during long sessions.
Deployment Parameters
This configuration is locked for sustained single-user deployment. No dynamic context routing, no concurrent model loading, no over-engineered orchestration.
Locked Baseline:
- Vulkan + OpenMP + Native ISA compilation
- Global CPU defaults with per-model parameter overrides
gemma-4-26bat 18 layers (16.9 t/s, 9.8 GB GTT)gemma-4-e4bat 32 layers (full GPU offload)ctx-size = 16384default, model-specific reductions where requiredthreads = 6on 26B profile for thermal regulation- Single-model residency enforced via
--models-max 1
The stack delivers deterministic throughput, stable memory residency, and predictable thermal behavior within hardware constraints. Configuration changes are restricted to model quantization updates or hardware replacement.


I have multiple models configured, but only one model is loaded at any given time
llama-server --port 8080 --host 0.0.0.0 \ --models-preset /mnt/data/ai/models.ini \ --models-max 1 \ --tools alllook at models-max, since my hardware is potato, i dont want to load more than a model at a time, but if your hardware allows, you can load more than one.
from llama-server web interface, i can switch models based on what i need to get done (no need for llama swap).
hope this helps!