I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.
i can’t load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm
but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.
I really feel like this is too good to be true and im doing something wrong but it just works so i don’t know.
I would suggest trying exllamav3 once, i have no idea what kind of black magic they use but its very memory efficient.
i can’t load Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 with 16K using vllm
but using exlamav3 i can SOMEHOW load ArtusDev/Qwen_Qwen3-Coder-30B-A3B-Instruct-EXL3:8.0bpw_H8 at its full context of 262.144 with still 2GiB to spare.
I really feel like this is too good to be true and im doing something wrong but it just works so i don’t know.
I guess there’s some automatic vram paging going on. How many tokens per second do you get while generating?