GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a “thinking mode” for advanced reasoning and tool use, and a “non-thinking mode” for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs
Blog post: https://z.ai/blog/glm-4.5
Hugging Face:
If you want CPU offload, IK_llama.cpp is explicitly designed for that and your go-to. It keeps the “dense” part of the model on the GPUs and offloads the lightweight MoE bits to CPU
Vllm and exllama are GPU only. Vllm’s niche is that it’s very fast with short context parallel calls (aka for serving dozens of users at once with small models), while exllama uses SOTA quantization for squeezing large models onto GPUs with minimal loss.
IK sounds promising! Will check it out to see if it can run in a container
I’m just gonna try vllm, seems like ik_llama.cpp doesnt have a quick docker method
It should work in any generic cuda container, but yeah it’s more of a hobbyist engine. Honestly I just run it raw since it’s dependency free, except for system CUDA.
Vllm absolutely cannot CPU offload AFAIK, but small models will fit in your vram with room to spare.