• Domi@lemmy.secnd.me
    link
    fedilink
    English
    arrow-up
    2
    ·
    6 days ago

    Good to hear that people are starting to get their hands on them, still have to wait ~2 weeks for mine.

    Besides the benchmarks already listed on kyuz0’s Github, the things that are interesting to me:

    • GLM-4.5 (not Air) at something like IQ2_XXS or IQ2_M. Not sure if it can even fit with any reasonable context size and if it’s even useful at all at that size but I have not seen anyone try yet: https://huggingface.co/unsloth/GLM-4.5-GGUF
    • Image generation with FLUX.1 (fp8 and fp16). Just got a new video with image generation on this chip but it’s only with Qwen Image: https://youtu.be/7-E0a6sGWgs
    • Power usage with a large model loaded but idle
    • What’s the cold start time from no model loaded to first token? Is it doable to run something like llama-swap and swap models on the fly without having to wait?
    • panda_abyss@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      5 days ago

      I can’t run GLM 4.5 on those quants, I’ve been unable to get beyond 96gb vram (I know you can get 112, but I’m still a linux noob)

      GPT OSS 120b (60gb) loads into clear memory in 37-45s (tested 3 times), but I think it can take up to 60s if there are other models in memory. I’m not sure what’s going on there, it should take ~10s to read the model from disk, but I do get a lock error in lmstudio and an alloc failure.

      I don’t know how to measure idle power with a model in memory (linux noob), but it’s been on my desk all day with either GPT120b or Qwen Code and has been pretty quiet (just PSU fan running off and on). With Framework the fan seems to start at aroudn 55C, the system idles with a model in memory at 45-50C.

      I’ll try and figure out comfyui or a nice way to run image models then get back to you. They’re not really something I need/use, so I’m starting from zero on how to run them.

        • panda_abyss@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          5 days ago

          I’ve got memory setup to use the full 128gb, downloading the IQ2_XXS GLM 4.5, and I’m also downloading the strix-halo quen image/video containers.

          It’s going to be a couple hours, so I’ll check in tomorrow morning and update you

          Edit: the qwen image toolbox is not working at all, literally zero iteration speed and none of my memory is being allocated. It seems to think I have 512MB of vram instead of the shared 128GB, this is probably a bug.

          • Domi@lemmy.secnd.me
            link
            fedilink
            English
            arrow-up
            1
            ·
            5 days ago

            It seems to think I have 512MB of vram instead of the shared 128GB, this is probably a bug.

            I think that is normal, it shows 512MB in his video as well.

            Not sure what’s going on with the zero iterations though.

                • panda_abyss@lemmy.ca
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  3 days ago

                  Yeah, I’ve discovered if you lower resolution and steps to ~4 you can prototype a prompt on blurry images.

                  Once you’ve got something with a good layout 40 steps works better

                  I’m a bit disappointed in Qwen image’s prompt processing, it’s good, but it does not have good knowledge and will swap people out if you ask for real people.

            • panda_abyss@lemmy.ca
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              5 days ago

              Was unable to get GLM 4.5 UD in that quant through LM studio, I’ll try just llama.cpp instead

              edit: Runs fine in llama.cpp, 5.1-5.6 tok/s on CPU, but I can’t seem to fit the whole memory on GPU. Still experimenting.

              llama-cli --ngl 93 --context 12288 --no-mmap -t 16
              

              12k context seems like the largest I can get (uses 118GB). You can probably push it further without a GUI, but on a desktop environment gnome daemon starts killing processes.

              Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

              I would say this verges on not-usable between the speed and context window. After the thinking tokens are through you’ve burned a lot of your usable context.

              edit: implementing conway’s game of life in numpy worked, used 3k/12k context, and took 7minutes.

              • Domi@lemmy.secnd.me
                link
                fedilink
                English
                arrow-up
                1
                ·
                3 days ago

                Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

                Is that still on CPU or did you get it working on GPU?

                I have seen a few people recommending GLM 4.5 at lower quants primarily for more intricate writing, might be worth the lower speed and context size for shorter texts.

                Thanks for testing!

                • panda_abyss@lemmy.ca
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  3 days ago

                  That was GPU, CPU was 5.

                  I’ve also tested the image processing more, a 512x512 takes about a minute, 1400x900 takes about 7-10, and image to image takes about 10 minutes

                  Most of the time is spent on the encoder decoder layers for image to image, and decoding is what shapes the slowest with image size

            • panda_abyss@lemmy.ca
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              5 days ago

              Edit: image Gen works through comfyui, but is slow. Exact same experience as the video, works as he says. Not rendering text literally halves the compute time, but Qwen Image works.

              So I had neglected to download the models. Chalk that up to trying to do this after coming home from the bar.

              Qwen image is not working in that toolbox container.

              First attempt had bad memory access when trying to merge the model and LoRA, second attempt it kept trying to use CUDA so failed to generate, third attempt it reached 100% of denoising generation but started running into strong GPU lag on the rest of the desktop environment and never produced the image.

              Fourth attempt failed to merge LoRA again due to HIP memory errors – this is after a fresh reboot, so no resource contention.

              Rebooted qwen, fifth attempt it does merge LoRA and start generation, but never actually finishes an iteration. For some reason this time it appears to be trying to run on 4 CPUs. This run again in the logs said it was trying CUDA, so… I suspect it’s the same failure.

              It looks to me like a torch configuration issue.___