The Qwen3.5 models are still the best local models I’ve used, so I’m excited to see how this updated version performs.

What kind of system requirements to run this new model decently?
I’m running it with the UD_Q4_K_XL quant on 24GB VRAM 7900XTX at ~85 token/s. Since it’s an MOE model, CPU inference with 32 GB ram should be doable, but I won’t make any promises on speed.
Thanks! That sounds expensive. Hopefully 24GB VRAM gets cheaper or models get more efficient soon.
You would want to wait till smaller models for 3.6 are released, I’d assume it’ll be soon
Thanks! I’m hoping to run at least 20B. Idk if I can do that fast enough without 24GB. Seems to be the sweet spot.
Probably 24 GB VRAM and 32-64 GB RAM for minimum specs with 4-bit quantization. This is a beefy boi.
Thanks! Not for me yet. Hope to save up enough to get 24GB VRAM in near future.
How do you install it? Im using AI lab on podman which can import new models but I haven’t been able to load other models just the ones it comes with. The ones I tried loading just crash when I start the service.
I agree with the suggestion of the other commenters, just wanted to add that I personally run llama.cpp directly with the build in llama-server. For a single-user server this seems to work great and is almost always at the forefront of model support.
Not sure about AI lab (although I also use podman, prefer Llama-swap), but pretty much everything uses llama-cpp under the hood, which usually takes a day or three to setup for a new architecture. Although I seem to recall them being ready for Qwen3.5 day one due to collaboration.
I find giving it a week or so for the dust to settle (even if it ‘works’, best parameters, quantization bugs etc take a while to shake out) unless there’s a huge motivation.
Also benches are more like guidelines than actual rules, best to do your own on your own use cases.
Okay, thanks for the help! I’ll give llama swap a shot.
AI lab
Had a quick squiz at it, and if it meets your needs, I’d just wait, unless you want to get into the lower levels of things (and run linux), llama-swap is just an inference server, you’ll need something like Open WebUI for chat as well etc.
I have no experience with AI lab or what formats it supports but you can download raw model files from Huggingface, or use a tool like Ollama that can pull them (either from its own model repos or from huggingface directly, there is an ollama copy-paste command in huggingface). There are lots of formats but GGUF is usually easiest and a single file.





