Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

BB84@mander.xyz · edit-2 15 hours ago

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

BB84@mander.xyz · 15 hours ago

My oversimplified and possibly wrong understanding: this is like speculative decoding, but instead of a separate draft model (which does its own prompt processing), they use some diffusion thing strapped on top of the main model. The diffusion reuses the high-quality prompt processing result of the main model.

The 7.8x faster claim sounds almost too good to be true. But even if we get like 3x then this is still a huge revolution in localLLMing.

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

GitHub - chiennv2000/orthrus: Fast, lossless LLM inference via dual-view diffusion decoding.