llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

System & Software Stack

Hardware: ASUS Zenbook 15 UM3504DA | AMD Ryzen 7 7735U (8C/16T) | Radeon 680M iGPU (512 MB BIOS-limited VRAM) | 32 GB LPDDR5 RAM
OS: CachyOS (Arch Linux) | Wayland + Niri compositor
Runtime: llama.cpp custom Vulkan build | llama-server with preset routing
Deployment Scope: Single-user local inference | 2–3 year static configuration window

Build Configuration

The binary is compiled with hardware-aware optimizations and server/tooling support. Each flag addresses a specific constraint or capability of the target platform.

cmake .. \
  -DGGML_NATIVE=ON \
  -DGGML_OPENMP=ON \
  -DGGML_VULKAN=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_TOOLS=ON

Flag	Purpose	Measured Impact
`GGML_NATIVE=ON`	Enables CPU-specific ISA extensions (AVX2/AVX512)	+10–15% prompt throughput on Zen 3+ cores
`GGML_OPENMP=ON`	Parallelizes prompt processing across available cores	Required for batched CPU inference
`GGML_VULKAN=ON`	GPU acceleration backend	Mandatory for Rembrandt iGPU. ROCm unsupported. CUDA inapplicable.
`CMAKE_INTERPROCEDURAL_OPTIMIZATION=ON`	Link-time optimization	Reduces binary size, improves instruction cache locality
`DLLAMA_BUILD_SERVER=ON`	Compiles HTTP server with OpenAI-compatible API	Enables remote UI and agent routing
`DLLAMA_BUILD_TOOLS=ON`	Enables structured function calling	Required for agentic task execution

Server Launch & Routing Architecture

The server is invoked with strict resource controls to prevent memory thrashing on constrained hardware:

llama-server --port 8080 --host 0.0.0.0 \
  --models-preset /mnt/data/ai/models.ini \
  --models-max 1 \
  --tools all

--models-max 1: Enforces single-model residency. Prevents concurrent RAM/GTT allocation spikes.
--models-preset: Loads declarative INI configuration for deterministic parameter application.
--tools all: Activates full OpenAI-compatible tool/schema support for agent workflows.
Port 8080 bound to all interfaces for integration with local UIs (OpenWebUI, Helium) and routing scripts.

Configuration Architecture (`models.ini`)

The preset system uses a global-defaults + per-model-override structure. This eliminates runtime flag management, ensures baseline stability across all workloads, and allows precise parameter alignment per model architecture.

version = 1

[*]
; Global defaults - CPU-optimized baseline
seed = -1
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0
repeat-penalty = 1.1
jinja = true
batch-size = 256
ubatch-size = 256
threads = 8
threads-batch = 8
cpu-range = 0-7
cpu-strict = 1
kv-offload = false
defrag-thold = 0.1
poll = 25
poll-batch = 50
cpu-moe = true
gpu-layers = 0
ctx-size = 16384

Global defaults prioritize CPU affinity, strict thread binding, MoE routing on CPU, and conservative KV cache management. Per-model sections override only the parameters required for their specific workload profile.

Per-Model Profiles & Parameter Rationale

Quick Reasoning: `gemma-4-e4b`

[gemma-4-e4b]
model = /mnt/data/models/daily/google_gemma-4-E4B-it-Q4_K_M.gguf
temperature = 0.7
reasoning-budget = 256
gpu-layers = 32
ctx-size = 32768

Purpose: Low-latency code completion, rapid drafting, lightweight Q&A.
Rationale: 4B MoE fits entirely within GPU offload limits. Extended context (32K) enables long-file navigation. reasoning-budget = 256 constrains chain-of-thought to prevent token waste. temperature = 0.7 maintains creative variance for ideation tasks.

General Purpose: `gemma-4-26b` (Daily Driver)

[gemma-4-26b]
model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-IQ4_NL.gguf
temperature = 0.65
repeat-penalty = 1.05
reasoning-budget = 512   
batch-size = 512          
ubatch-size = 512         
defrag-thold = 0.05
gpu-layers = 18

Purpose: Primary conversational, analytical, and long-form generation workload.
Rationale: Heavily optimized for sustained throughput and thermal stability. Detailed parameters documented in the following section.

Agentic Router: `qwen3.5-9b`

[qwen3.5-9b]
model = /mnt/data/models/daily/Qwen_Qwen3.5-9B-Q4_K_M.gguf
temperature = 0.65
top-k = 25
repeat-penalty = 1.05

Purpose: Function calling, tool selection, structured API routing.
Rationale: top-k = 25 narrows sampling distribution to improve tool-call determinism. Reduced repeat-penalty prevents schema repetition loops. Global CPU defaults apply to minimize latency during routing.

Complex Reasoning: `qwen3.6-35b`

[qwen3.6-35b]
model = /mnt/data/models/daily/Qwen_Qwen3.6-35B-A3B-Q3_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
ctx-size = 8192

Purpose: Deep analysis, multi-step reasoning, constrained exploration.
Rationale: 35B MoE requires memory safety limits. ctx-size = 8192 prevents GTT saturation. presence-penalty = 0.8 forces lexical diversity during long-form generation. reasoning-budget = 256 maintains structured output without unbounded context accumulation.

Experimental: `lfm2-24b`

[lfm2-24b]
model = /mnt/data/models/experimental/LFM2-24B-A2B-Q4_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05

Purpose: Architecture evaluation, quantization testing, parameter isolation.
Rationale: Mirrors 35B safety guardrails. Kept separate from daily workflows to prevent context contamination or parameter bleed during testing.

Primary Model Optimization: Gemma-4-26B

The 26B MoE profile represents the core optimization target. Parameter selection resulted from systematic empirical testing across offload depth, batch sizing, cache management, and thermal behavior.

Parameter	Value	Rationale
`gpu-layers`	18	Measured efficiency sweet spot. Beyond 18 layers, GTT usage exceeds 9.8 GB with diminishing returns (+0.15 t/s per layer).
`batch-size` / `ubatch-size`	512	Increased from global 256. Matches prompt throughput requirements without exceeding KV cache limits.
`defrag-thold`	0.05	Aggressive KV cache defragmentation prevents memory fragmentation during long sessions.
`threads`	6 (override)	Reduced from global 8. Maintains baseline CPU activity to trigger firmware fan curves during GPU-heavy inference.
`reasoning-budget`	512	Enforces structured chain-of-thought. Improves cache locality and prevents context bloat.
`temperature` / `repeat-penalty`	0.65 / 1.05	Balances coherence with lexical variation. Lower repeat penalty prevents over-penalization in technical prose.

Measured Performance:

CPU-only (0 layers): 9.9 t/s generation | 0 GB GTT
18-layer offload: 16.9 t/s generation | 9.8 GB GTT | Stable >50 hrs
24-layer offload: 18.6 t/s generation | 12.6 GB GTT | Marginal stability
Real-world 2,090-token response: 116s → 68s (40% reduction)

Hardware Constraints & Empirical Findings

VRAM Limitation: BIOS locks dedicated VRAM to 512 MB. GPU offloading immediately utilizes GTT (system RAM mapped as VRAM). amdgpu_top confirms usable VRAM caps at ~450 MB.
Offloading Diminishing Returns: Layers 0–6 yield +0.73 t/s per layer. Layers 6–18 yield +0.15–0.33 t/s per layer. Layers 18–24 yield +0.15–0.28 t/s per layer with >0.5 GB GTT cost per layer. Stability degrades past 20 layers.
Thermal Firmware Constraint: ASUS fan curves respond exclusively to CPU load. GPU-only inference bypasses thermal regulation. threads = 6 ensures consistent CPU activity to maintain airflow.
Context Scaling: Generation throughput drops ~27% at 40% context fill due to O(n) attention scanning. 16K context is the practical ceiling for 32 GB RAM with 20B+ models.
Reasoning Budget as Cache Optimizer: Enforcing explicit reasoning tokens structures KV cache layout, reduces attention fragmentation, and prevents unbounded context accumulation during long sessions.

Deployment Parameters

This configuration is locked for sustained single-user deployment. No dynamic context routing, no concurrent model loading, no over-engineered orchestration.

Locked Baseline:

Vulkan + OpenMP + Native ISA compilation
Global CPU defaults with per-model parameter overrides
gemma-4-26b at 18 layers (16.9 t/s, 9.8 GB GTT)
gemma-4-e4b at 32 layers (full GPU offload)
ctx-size = 16384 default, model-specific reductions where required
threads = 6 on 26B profile for thermal regulation
Single-model residency enforced via --models-max 1

The stack delivers deterministic throughput, stable memory residency, and predictable thermal behavior within hardware constraints. Configuration changes are restricted to model quantization updates or hardware replacement.

variety4me@lemmy.zipOP English

3·

2 days ago

I have multiple models configured, but only one model is loaded at any given time

look at models-max, since my hardware is potato, i dont want to load more than a model at a time, but if your hardware allows, you can load more than one.

from llama-server web interface, i can switch models based on what i need to get done (no need for llama swap).

hope this helps!

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

System & Software Stack

Build Configuration

Server Launch & Routing Architecture

Configuration Architecture (models.ini)

Per-Model Profiles & Parameter Rationale

Quick Reasoning: gemma-4-e4b

General Purpose: gemma-4-26b (Daily Driver)

Agentic Router: qwen3.5-9b

Complex Reasoning: qwen3.6-35b

Experimental: lfm2-24b

Primary Model Optimization: Gemma-4-26B

Hardware Constraints & Empirical Findings

Deployment Parameters

Configuration Architecture (`models.ini`)

Quick Reasoning: `gemma-4-e4b`

General Purpose: `gemma-4-26b` (Daily Driver)

Agentic Router: `qwen3.5-9b`

Complex Reasoning: `qwen3.6-35b`

Experimental: `lfm2-24b`