When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.

I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).

Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )

And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.

I’m rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.

But these edge models are the ‘curiosity category’ now.

  • SuspciousCarrot78@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    17 hours ago

    As I recall there are some new tricks that allow up to 8B models to run on a Raspberry Pi 5 and around 10-15 tokens per second with --ctx 32768. I haven’t kept across it because I don’t visit Reddit but that was my last recollection. If you fossick over there, you may be able to find it. Or use kagi.com to find it, heh.

    One of the goals of the harness that I built was to reduce memory pressure, particularly KV cache, so that you could run larger models on more constrained hardware, but I’m not here to spruik myself. I’m just letting you know that there are ways and means to get it done on SBCs.

    EDIT: I “kagi’ed” it for you. Here

    qwen3.5 9B Q8_0 8.86 GiB 8.95 B CPU 4 0 pp512 18.20 ± 0.23 tok/s