GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a “thinking mode” for advanced reasoning and tool use, and a “non-thinking mode” for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs
Blog post: https://z.ai/blog/glm-4.5
Hugging Face:
I’m just gonna try vllm, seems like ik_llama.cpp doesnt have a quick docker method
It should work in any generic cuda container, but yeah it’s more of a hobbyist engine. Honestly I just run it raw since it’s dependency free, except for system CUDA.
Vllm absolutely cannot CPU offload AFAIK, but small models will fit in your vram with room to spare.