GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a “thinking mode” for advanced reasoning and tool use, and a “non-thinking mode” for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs

Blog post: https://z.ai/blog/glm-4.5

Hugging Face:

https://huggingface.co/zai-org/GLM-4.5

https://huggingface.co/zai-org/GLM-4.5-Air

  • doodlebob@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    8 days ago

    I’ll take a look at both tabby and vllm tomorrow

    Hopefully there’s cpu offload in the works so I can test those crazy models without too much fiddling in the future (server also has 128gb of ram)

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      8 days ago

      If you want CPU offload, IK_llama.cpp is explicitly designed for that and your go-to. It keeps the “dense” part of the model on the GPUs and offloads the lightweight MoE bits to CPU

      Vllm and exllama are GPU only. Vllm’s niche is that it’s very fast with short context parallel calls (aka for serving dozens of users at once with small models), while exllama uses SOTA quantization for squeezing large models onto GPUs with minimal loss.

          • brucethemoose@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            edit-2
            7 days ago

            It should work in any generic cuda container, but yeah it’s more of a hobbyist engine. Honestly I just run it raw since it’s dependency free, except for system CUDA.

            Vllm absolutely cannot CPU offload AFAIK, but small models will fit in your vram with room to spare.