• mierdabird@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    3
    ·
    3 days ago

    The update is giving me a performance uplift on my 3060 that’s WAY more than 7%, using qwen2.5-coder:14b-instruct-q5_K_M here’s rerunning the exact same prompt before and after:

  • panda_abyss@lemmy.ca
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    1
    ·
    edit-2
    4 days ago

    Does ollama work better than vanilla llama.cpp?

    I’ve just migrated from LM Server to llama.cpp to try and liberate my stack a bit, and also heard ollama have up support for AMD chips

    Edit: fixed very bad autocorrect

    • afk_strats@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      4 days ago

      It depends on what you mean.

      To me, Ollama feels like it’s designed to be a developer-first, local LLM server with just enough functionality to get you to a POC, from where you’re intended to use someone else’s compute resources.

      llama.cpp actually supports more backends, with continuous performance improvements and support for more models.

    • vividspecter@aussie.zone
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      4 days ago

      Ollama uses ROCm whereas llama.cpp uses Vulkan compute. Which one will perform better depends on many factors, but Vulkan compute should be easier to setup.

      • afk_strats@lemmy.world
        link
        fedilink
        English
        arrow-up
        9
        ·
        edit-2
        4 days ago

        Ollama does use ROCm, however, so does llama.cpp. Vulkan happens to be another available backend supported by llama.cpp.

        GitHub: llama.cpp Supported Backends

        There is an old PRs which attempted to bring Vulkan support to Ollama - a logical and helpful move, given that the Ollama engine is based on llama.cpp - but the Ollama maintainers weren’t interested.

        As for performance vs ROCm, it does fine. Against CUDA, it also does well unless you’re in a mulit-gpu setup. Its magic trick is compatibility. Pretty much everything runs Vulkan. And Vulkan is intecompatible between generations of cards, architectures AND vendors. That’s how I’m running a single PC with Nvidia and AMD cards together

        • hendrik@palaver.p3x.de
          link
          fedilink
          English
          arrow-up
          2
          ·
          4 days ago

          I think llama.cpp merged ROCm support in 2023 already. It’s called HIP on their Readme, but I’m not super educated on all the acronyms and compute frameworks and instruction sets.

          • afk_strats@lemmy.world
            link
            fedilink
            English
            arrow-up
            5
            ·
            4 days ago

            ROCm is a software stack which includes a bunch of SDKs and API.

            HIP is a subset of ROCm which lets you program on AMD GPUs with focus portability from Nvidia’s CUDA

    • okwhateverdude@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      4 days ago

      I dunno about better, but different. The API and model management that it offers has been nice when building things that want to use different sized models for different tasks since it will mange the given resources and schedule runners on GPU/CPU. My hardware combo is intel/nvidia so I’ve not had to futz with getting AMD stuff running. If you don’t need any of that, and llama.cpp works for you, no reason to use ollama

      • panda_abyss@lemmy.ca
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        4 days ago

        That is something I wish was easier with llama.cpp

        I’m using llama swap for that but you have to manually specify your models in a yaml config, then you can set up groups of modes that can run at the same time.

        I also have to manually download models, which is a more cumbersome.

        • okwhateverdude@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          4 days ago

          Yeah for sure. Ollama makes all of this way easier including downloading models at runtime (assuming your query can wait that long, lol). I’ve been very pleased so far in the functionality it gives me. That said, if I was building a very tight integration or a desktop app, I would probably use llama.cpp directly. It just depends on the usecase and scale. I do wish they (EDIT: ollama) would be better netizens and upstreaming their changes to llama.cpp. Also, it is unfortunate that at some point ollama will get enshittified (no more easy model downloads from their library without an account, etc) if only because they are building a company around it. So I am really thankful that llama.cpp continues to be such foundational piece for FOSS LLM infra.

        • afaix@lemmy.world
          link
          fedilink
          English
          arrow-up
          0
          ·
          3 days ago

          Doesn’t llama.cpp have a -hf flag to download models from huggingface instead of doing it manually?