I dunno about better, but different. The API and model management that it offers has been nice when building things that want to use different sized models for different tasks since it will mange the given resources and schedule runners on GPU/CPU. My hardware combo is intel/nvidia so I’ve not had to futz with getting AMD stuff running. If you don’t need any of that, and llama.cpp works for you, no reason to use ollama
That is something I wish was easier with llama.cpp
I’m using llama swap for that but you have to manually specify your models in a yaml config, then you can set up groups of modes that can run at the same time.
I also have to manually download models, which is a more cumbersome.
Yeah for sure. Ollama makes all of this way easier including downloading models at runtime (assuming your query can wait that long, lol). I’ve been very pleased so far in the functionality it gives me. That said, if I was building a very tight integration or a desktop app, I would probably use llama.cpp directly. It just depends on the usecase and scale. I do wish they (EDIT: ollama) would be better netizens and upstreaming their changes to llama.cpp. Also, it is unfortunate that at some point ollama will get enshittified (no more easy model downloads from their library without an account, etc) if only because they are building a company around it. So I am really thankful that llama.cpp continues to be such foundational piece for FOSS LLM infra.
I dunno about better, but different. The API and model management that it offers has been nice when building things that want to use different sized models for different tasks since it will mange the given resources and schedule runners on GPU/CPU. My hardware combo is intel/nvidia so I’ve not had to futz with getting AMD stuff running. If you don’t need any of that, and llama.cpp works for you, no reason to use ollama
That is something I wish was easier with llama.cpp
I’m using llama swap for that but you have to manually specify your models in a yaml config, then you can set up groups of modes that can run at the same time.
I also have to manually download models, which is a more cumbersome.
Yeah for sure. Ollama makes all of this way easier including downloading models at runtime (assuming your query can wait that long, lol). I’ve been very pleased so far in the functionality it gives me. That said, if I was building a very tight integration or a desktop app, I would probably use llama.cpp directly. It just depends on the usecase and scale. I do wish they (EDIT: ollama) would be better netizens and upstreaming their changes to llama.cpp. Also, it is unfortunate that at some point ollama will get enshittified (no more easy model downloads from their library without an account, etc) if only because they are building a company around it. So I am really thankful that llama.cpp continues to be such foundational piece for FOSS LLM infra.
Doesn’t llama.cpp have a -hf flag to download models from huggingface instead of doing it manually?
It does, but I’ve never tried it, I just use the hf cli