Qwen3.6-35B-A3B released

TheCornCollector@piefed.zip · 2 months ago

Qwen3.6-35B-A3B released

venusaur@lemmy.world · 2 months ago

What kind of system requirements to run this new model decently?

TheCornCollector@piefed.zip · edit-2 2 months ago

I’m running it with the UD_Q4_K_XL quant on 24GB VRAM 7900XTX at ~120-130* token/s. Since it’s an MOE model, CPU inference with 32 GB ram should be doable, but I won’t make any promises on speed.

*Edit: I had a configuration issue on my llama.cpp that reduced the performance. It was limited to 85 tk/s but that was user error on my part.

venusaur@lemmy.world · 2 months ago

Thanks! That sounds expensive. Hopefully 24GB VRAM gets cheaper or models get more efficient soon.

Jakeroxs@sh.itjust.works · 2 months ago

You would want to wait till smaller models for 3.6 are released, I’d assume it’ll be soon

venusaur@lemmy.world · 2 months ago

Thanks! I’m hoping to run at least 20B. Idk if I can do that fast enough without 24GB. Seems to be the sweet spot.

fonix232@fedia.io · 2 months ago

Wonder what the wombo-combo of Ryzen AI APU can do with this.

Time to fire up the trusty 370.

ericwdhs@discuss.online · 2 months ago

Can I ask you what GPU driver version you’re running? I’m running a 7900 XTX as well and recently encountered some stability issues after a driver update (trying to support gaming and AI stuff at the same time). The latest version I could find as a recommendation for similar issues was 24.12.1.

TheCornCollector@piefed.zip · 2 months ago

Ah, I don’t know anything about Windows. I’m using Linux and both the latest ROCM (7.2.2) and latest vulkan (26.0.5) packages work without issues for combined gaming and AI. My reported numbers were with Vulkan at zero context for reference.

ericwdhs@discuss.online · 2 months ago

Thanks! I’m migrating all my PCs to Linux anyway and just haven’t gotten to the AI stuff yet, so it sounds like that might fix itself.

Infinite@lemmy.zip · 2 months ago

Probably 24 GB VRAM and 32-64 GB RAM for minimum specs with 4-bit quantization. This is a beefy boi.

venusaur@lemmy.world · 2 months ago

Thanks! Not for me yet. Hope to save up enough to get 24GB VRAM in near future.

altphoto@lemmy.today · 2 months ago

How do you install it? Im using AI lab on podman which can import new models but I haven’t been able to load other models just the ones it comes with. The ones I tried loading just crash when I start the service.

TheCornCollector@piefed.zip · 2 months ago

I agree with the suggestion of the other commenters, just wanted to add that I personally run llama.cpp directly with the build in llama-server. For a single-user server this seems to work great and is almost always at the forefront of model support.

MalReynolds@slrpnk.net · 2 months ago

Not sure about AI lab (although I also use podman, prefer Llama-swap), but pretty much everything uses llama-cpp under the hood, which usually takes a day or three to setup for a new architecture. Although I seem to recall them being ready for Qwen3.5 day one due to collaboration.

I find giving it a week or so for the dust to settle (even if it ‘works’, best parameters, quantization bugs etc take a while to shake out) unless there’s a huge motivation.

Also benches are more like guidelines than actual rules, best to do your own on your own use cases.

altphoto@lemmy.today · 2 months ago

Okay, thanks for the help! I’ll give llama swap a shot.

MalReynolds@slrpnk.net · 2 months ago

AI lab

Had a quick squiz at it, and if it meets your needs, I’d just wait, unless you want to get into the lower levels of things (and run linux), llama-swap is just an inference server, you’ll need something like Open WebUI for chat as well etc.

cecilkorik@piefed.ca · 2 months ago

I have no experience with AI lab or what formats it supports but you can download raw model files from Huggingface, or use a tool like Ollama that can pull them (either from its own model repos or from huggingface directly, there is an ollama copy-paste command in huggingface). There are lots of formats but GGUF is usually easiest and a single file.

SuspciousCarrot78@lemmy.world · edit-2 2 months ago

deleted by creator

FrankLaskey@lemmy.ml · edit-2 2 months ago

Was just looking at the benchmarks for it on artificialanalysis.ai and it looks great for the size. Probably best available for general use if you’re looking for something under 40b parameters I’d say. Even more impressive is the agentic capabilities and the fact that it is actually decent in terms of hallucinations (not amazing given it’s a small-medium size model but decent).

TheCornCollector@piefed.zip · edit-2 2 months ago

I’ve been using it for the past few days and the output quality seems to be on par or slightly better than 3.5 27b. The biggest issue is the token usage that has exploded with this revision. It can easily reason for 20k-25k tokens on a question where the qwen3.5 models used 10k. Since it runs more than 3 times faster, it still finished earlier than the 27b, but I won’t have any context/vram left to ask multiple questions.

Artificial Analysis has similar findings. Bar graph of output tokens for different models. Qwen3.6 35b: 140 million, Qwen3.5 35b: 100 million

FrankLaskey@lemmy.ml · 2 months ago

Yes I did see that as well. That does seem to be the real Achilles heel here. Will have to try it myself to see how much it exacerbates context size limitations given I would be running it on a single 24 GB VRAM GPU. I wonder if adjusting reasoning effort parameters could make a difference without affecting quality too much?

locuester@lemmy.zip · 2 months ago

Need one with more parameters. The quality here is way lower than something like Qwen 3.5 397 4-bit quant

Qwen3.6-35B-A3B released

Qwen3.6-35B-A3B released

Qwen/Qwen3.6-35B-A3B · Hugging Face