My oversimplified and possibly wrong understanding: this is like speculative decoding, but instead of a separate draft model (which does its own prompt processing), they use some diffusion thing strapped on top of the main model. The diffusion reuses the high-quality prompt processing result of the main model.
The 7.8x faster claim sounds almost too good to be true. But even if we get like 3x then this is still a huge revolution in localLLMing.
My oversimplified and possibly wrong understanding: this is like speculative decoding, but instead of a separate draft model (which does its own prompt processing), they use some diffusion thing strapped on top of the main model. The diffusion reuses the high-quality prompt processing result of the main model.
The 7.8x faster claim sounds almost too good to be true. But even if we get like 3x then this is still a huge revolution in localLLMing.