I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.
i can’t load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm
but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.
I really feel like this is too good to be true and im doing something wrong but it just works so i don’t know.
I guess there’s some automatic vram paging going on. How many tokens per second do you get while generating?
Im not sure if it’s a me issue but that’s a static image. I figure you posted where they throw a brick into it.
Also, if this post was serious, how does a highly quantitized model compare to something less quantitized but with fewer parameters? I haven’t seen benchmarks other than perplexity which isn’t a good measure of capability?
Unsloth did a test and their dynamic quants were competitive even at 1 bit in aider benchmark https://docs.unsloth.ai/new/unsloth-dynamic-ggufs-on-aider-polyglot
Holy cow!
It’s a webp animation. Maybe your client doesn’t display it right, i’ll replace it with a gif
Regarding your other question, I tend to see better results with higher params + lower precision, versus low params + higher precision. That’s just based on “vibes” though, I haven’t done any real testing. Based on what I’ve seen, Q4 is the lowest safe quantization, and beyond that, the performance really starts to drop off. unfortunately even at 1 bit quantization I can’t run GLM 4.6 on my system
That fixed it.
I am a fan of this quant cook. He often posts perplexity charts.
https://huggingface.co/ubergarm
All of his quants require ik_llama which works best with Nvidia CUDA but they can do a lot with RAM+vRAM or even hard drive + rams. I don’t know if 8gb is enough for everything.
What’s higher precision for you? What I remember from the old measurements for ggml is, lower than Q3 rarely makes sense and roughly at Q3 you’d think about switching to a smaller variant. But on the other hand everything above Q6 only shows marginal differences in perplexity, so Q6 or Q8 or full precision are basically the same thing.
As a memory-poor user (hence the 8gb vram card), I consider Q8+ to be is higher precision, Q4-Q5 is mid-low precision (what i typically use), and below that is low precision
Thanks. That sounds reasonable. Btw you’re not the only poor person around, I don’t even own a graphics card… I’m not a gamer so I never saw any reason to buy one before I took interest in AI. I’ll do inference on my CPU and that’s connected to more than 8GB of memory. It’s just slow 😉 But I guess I’m fine with that. I don’t rely on AI, it’s just tinkering and I’m patient. And a few times a year I’ll rent some cloud GPU by the hour. Maybe one day I’ll buy one myself.
I think perplexity is still central to evaluating models. It’s notoriously difficult to come up with other ways to measure these things.


