All the runtimes except Intel ones are llama.cpp Q4KMs, so the Ampere ones aren’t anything special.
…The Intel ones kinda are though. They actually have runtimes for CPU/GPU, and NPU, and AFAIK the CPU ones may be able to use AMX if you are on a server CPU.
It’s still not great for a lot of reasons, but one could do worse.
Does this mean they optimized for CPU instead of GPU? I doubt they target Intel GPUs tbh, so they really optimized for CPU… interesting!
All the runtimes except Intel ones are llama.cpp Q4KMs, so the Ampere ones aren’t anything special.
…The Intel ones kinda are though. They actually have runtimes for CPU/GPU, and NPU, and AFAIK the CPU ones may be able to use AMX if you are on a server CPU.
It’s still not great for a lot of reasons, but one could do worse.