HP Z2 Mini G1a Review: Running GPT-OSS 120B Without a Discrete GPU

mapumbaa@lemmy.zip · 20 days ago

HP Z2 Mini G1a Review: Running GPT-OSS 120B Without a Discrete GPU

panda_abyss@lemmy.ca · edit-2 5 days ago

Was unable to get GLM 4.5 UD in that quant through LM studio, I’ll try just llama.cpp instead

edit: Runs fine in llama.cpp, 5.1-5.6 tok/s on CPU, but I can’t seem to fit the whole memory on GPU. Still experimenting.

llama-cli --ngl 93 --context 12288 --no-mmap -t 16

12k context seems like the largest I can get (uses 118GB). You can probably push it further without a GUI, but on a desktop environment gnome daemon starts killing processes.

Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

I would say this verges on not-usable between the speed and context window. After the thinking tokens are through you’ve burned a lot of your usable context.

edit: implementing conway’s game of life in numpy worked, used 3k/12k context, and took 7minutes.

Domi@lemmy.secnd.me · 3 days ago

Prompt processing at 12.3t/s, inference at 10.7-11.1 t/s.

Is that still on CPU or did you get it working on GPU?

I have seen a few people recommending GLM 4.5 at lower quants primarily for more intricate writing, might be worth the lower speed and context size for shorter texts.

Thanks for testing!

panda_abyss@lemmy.ca · 3 days ago

That was GPU, CPU was 5.

I’ve also tested the image processing more, a 512x512 takes about a minute, 1400x900 takes about 7-10, and image to image takes about 10 minutes

Most of the time is spent on the encoder decoder layers for image to image, and decoding is what shapes the slowest with image size