When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.
I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).
Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )
And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.
I’m rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.
But these edge models are the ‘curiosity category’ now.


There are several 3B or less models that are surprisingly good. If you’re talking about a general chat model, you can get a lot of bang for your buck with Qwen3-1.7b. Granite-3B is also quite good (and obedient at tool calls, IIRC).
My every day driver is an ablit of Qwen3-4B 2507 instruct called Qwen HIVEMIND. I find it excellent…but again…black magic and clever tricks.
I’ve actually been scoping out the possibility of using ECA.dev and having something cheap / cloud based (say, GPT-5.4 mini) as the “brains” and SERA-8B as the “hands”.
GPT-5.4 mini is $0.75/M input tokens$4.50/M output tokens…and if it marries up with SERA-8B…well…that could go a long way indeed.
Small models can be made useful, as part of swarm architecture…but that’s not an apples : apples comparison.