• panda_abyss@lemmy.ca
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    5 days ago

    Was unable to get GLM 4.5 UD in that quant through LM studio, I’ll try just llama.cpp instead

    edit: Runs fine in llama.cpp, 5.1-5.6 tok/s on CPU, but I can’t seem to fit the whole memory on GPU. Still experimenting.

    llama-cli --ngl 93 --context 12288 --no-mmap -t 16
    

    12k context seems like the largest I can get (uses 118GB). You can probably push it further without a GUI, but on a desktop environment gnome daemon starts killing processes.

    Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

    I would say this verges on not-usable between the speed and context window. After the thinking tokens are through you’ve burned a lot of your usable context.

    edit: implementing conway’s game of life in numpy worked, used 3k/12k context, and took 7minutes.

    • Domi@lemmy.secnd.me
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 days ago

      Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

      Is that still on CPU or did you get it working on GPU?

      I have seen a few people recommending GLM 4.5 at lower quants primarily for more intricate writing, might be worth the lower speed and context size for shorter texts.

      Thanks for testing!

      • panda_abyss@lemmy.ca
        link
        fedilink
        English
        arrow-up
        2
        ·
        3 days ago

        That was GPU, CPU was 5.

        I’ve also tested the image processing more, a 512x512 takes about a minute, 1400x900 takes about 7-10, and image to image takes about 10 minutes

        Most of the time is spent on the encoder decoder layers for image to image, and decoding is what shapes the slowest with image size