• tristynalxander@mander.xyz
    link
    fedilink
    English
    arrow-up
    1
    ·
    13 hours ago

    I was just about to make a post asking for the best small model after finding out Qwen3-27B was way too slow, so Orthrus-Qwen3-8B looks like a pretty appealing option.

    • BB84@mander.xyzOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      12 hours ago

      They said they’re working on Orthus for Qwen 3.5. It’ll be amazing!

      • tristynalxander@mander.xyz
        link
        fedilink
        English
        arrow-up
        2
        ·
        11 hours ago

        Yeah, unfortunately it seems this can’t be converted to a llama.cpp compatible format yet, and that’s pretty big a tradeoff right now. Not surprising with how new it is, but we’ll have to wait to combine it with other improvements. Pretty exciting for the future though.

  • BB84@mander.xyzOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    15 hours ago

    My oversimplified and possibly wrong understanding: this is like speculative decoding, but instead of a separate draft model (which does its own prompt processing), they use some diffusion thing strapped on top of the main model. The diffusion reuses the high-quality prompt processing result of the main model.

    The 7.8x faster claim sounds almost too good to be true. But even if we get like 3x then this is still a huge revolution in localLLMing.