• BB84@mander.xyzOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    15 hours ago

    My oversimplified and possibly wrong understanding: this is like speculative decoding, but instead of a separate draft model (which does its own prompt processing), they use some diffusion thing strapped on top of the main model. The diffusion reuses the high-quality prompt processing result of the main model.

    The 7.8x faster claim sounds almost too good to be true. But even if we get like 3x then this is still a huge revolution in localLLMing.