GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a “thinking mode” for advanced reasoning and tool use, and a “non-thinking mode” for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs

Blog post: https://z.ai/blog/glm-4.5

Hugging Face:

https://huggingface.co/zai-org/GLM-4.5

https://huggingface.co/zai-org/GLM-4.5-Air

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    7 days ago

    It should work in any generic cuda container, but yeah it’s more of a hobbyist engine. Honestly I just run it raw since it’s dependency free, except for system CUDA.

    Vllm absolutely cannot CPU offload AFAIK, but small models will fit in your vram with room to spare.