I wrote about running LLMs locally on the Intel Arc Pro B60 GPU previously, where I used Intel’s official software stack (llm-scaler / vLLM).
This time I focus on the impactful open-source project llama.cpp:
https://marvin.damschen.net/post/intel-arc-llama.cpp/
Taking this opportunity to test federation with lemmy 😊:
@localllama


I did not try koboldcpp but gave it a try now.
I am not sure how to run the OpenCL backend, the wiki says “Vulkan is a newer option that provides a good balance of speed and utility compared to the OpenCL backend.” cli arguments do not mention it.
The Vulkan backend seems marginally slower than the one of llama.cpp for GLM-4.7-Flash: 19.8 tps with FlashAttention, 29.8 without.
It seems koboldcpp is a fork of llama.cpp, maybe some Vulkan optimisations have not made it there, yet.
@localllama