- cross-posted to:
- artificial_intel@lemmy.ml
- cross-posted to:
- artificial_intel@lemmy.ml
Crossposted from https://lemmy.ml/post/47429470
I was just about to make a post asking for the best small model after finding out Qwen3-27B was way too slow, so Orthrus-Qwen3-8B looks like a pretty appealing option.
They said they’re working on Orthus for Qwen 3.5. It’ll be amazing!
Yeah, unfortunately it seems this can’t be converted to a llama.cpp compatible format yet, and that’s pretty big a tradeoff right now. Not surprising with how new it is, but we’ll have to wait to combine it with other improvements. Pretty exciting for the future though.
My oversimplified and possibly wrong understanding: this is like speculative decoding, but instead of a separate draft model (which does its own prompt processing), they use some diffusion thing strapped on top of the main model. The diffusion reuses the high-quality prompt processing result of the main model.
The 7.8x faster claim sounds almost too good to be true. But even if we get like 3x then this is still a huge revolution in localLLMing.



