llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

variety4me@lemmy.zip · 2 days ago

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

nocteb@feddit.org · 2 days ago

Would I be correct in saying that since I’m just starting out using smaller models is a risk since I may not know how to sufficiently narrow the context and the larger model is more capable of catching oversights?

No not really. Smaller models tend to be worse with mode complex topics, so you would have to check it more if you are programming with it. You could just use one small model in a coding assistant and that can work for some stuff. The coding agent would behave the same the result might be worse. The context itself does not have to be narrowed for a smaller model. Usually they can process the same context size even faster than the larger model and you would even have space for more context because the model is smaller, but they might need more smaller steps and checking in between to make sure they don’t introduce errors that get worse with time. Smaller models also tend to be less reliable when it comes to calling tools (they more often make mistakes in parameter formats etc.). You can still have a lot of fun with smaller models and depending on the task they might also be sufficient.

Flag	Purpose	Measured Impact
`GGML_NATIVE=ON`	Enables CPU-specific ISA extensions (AVX2/AVX512)	+10–15% prompt throughput on Zen 3+ cores
`GGML_OPENMP=ON`	Parallelizes prompt processing across available cores	Required for batched CPU inference
`GGML_VULKAN=ON`	GPU acceleration backend	Mandatory for Rembrandt iGPU. ROCm unsupported. CUDA inapplicable.
`CMAKE_INTERPROCEDURAL_OPTIMIZATION=ON`	Link-time optimization	Reduces binary size, improves instruction cache locality
`DLLAMA_BUILD_SERVER=ON`	Compiles HTTP server with OpenAI-compatible API	Enables remote UI and agent routing
`DLLAMA_BUILD_TOOLS=ON`	Enables structured function calling	Required for agentic task execution

Parameter	Value	Rationale
`gpu-layers`	18	Measured efficiency sweet spot. Beyond 18 layers, GTT usage exceeds 9.8 GB with diminishing returns (+0.15 t/s per layer).
`batch-size` / `ubatch-size`	512	Increased from global 256. Matches prompt throughput requirements without exceeding KV cache limits.
`defrag-thold`	0.05	Aggressive KV cache defragmentation prevents memory fragmentation during long sessions.
`threads`	6 (override)	Reduced from global 8. Maintains baseline CPU activity to trigger firmware fan curves during GPU-heavy inference.
`reasoning-budget`	512	Enforces structured chain-of-thought. Improves cache locality and prevents context bloat.
`temperature` / `repeat-penalty`	0.65 / 1.05	Balances coherence with lexical variation. Lower repeat penalty prevents over-penalization in technical prose.

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

System & Software Stack

Build Configuration

Server Launch & Routing Architecture

Configuration Architecture (`models.ini`)

Per-Model Profiles & Parameter Rationale

Quick Reasoning: `gemma-4-e4b`

General Purpose: `gemma-4-26b` (Daily Driver)

Agentic Router: `qwen3.5-9b`

Complex Reasoning: `qwen3.6-35b`

Experimental: `lfm2-24b`

Primary Model Optimization: Gemma-4-26B

Hardware Constraints & Empirical Findings

Deployment Parameters

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

System & Software Stack

Build Configuration

Server Launch & Routing Architecture

Configuration Architecture (models.ini)

Per-Model Profiles & Parameter Rationale

Quick Reasoning: gemma-4-e4b

General Purpose: gemma-4-26b (Daily Driver)

Agentic Router: qwen3.5-9b

Complex Reasoning: qwen3.6-35b

Experimental: lfm2-24b

Primary Model Optimization: Gemma-4-26B

Hardware Constraints & Empirical Findings

Deployment Parameters

Configuration Architecture (`models.ini`)

Quick Reasoning: `gemma-4-e4b`

General Purpose: `gemma-4-26b` (Daily Driver)

Agentic Router: `qwen3.5-9b`

Complex Reasoning: `qwen3.6-35b`

Experimental: `lfm2-24b`