• TheMightyCat@ani.social
    link
    fedilink
    English
    arrow-up
    4
    ·
    16 hours ago

    I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.

    i can’t load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm

    but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.

    I really feel like this is too good to be true and im doing something wrong but it just works so i don’t know.

    • ffhein@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      I guess there’s some automatic vram paging going on. How many tokens per second do you get while generating?