When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.
I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).
Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )
And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.
I’m rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.
But these edge models are the ‘curiosity category’ now.


hey, thanks for your response… yeah that’s what I meant, the 2b models aren’t usable in today’s state, but more practical for everyday use if they work out…
I actually meant the 31b models are useful for my purpose. I don’t do full-on agentic coding, just interactive chat/prompting. Example, I make good use for making linux shell scripts (as I don’t know howto myself). Currently I use qwen3.5-flash via cloud. It’s as good as the frontier models back then if not better…
There are several 3B or less models that are surprisingly good. If you’re talking about a general chat model, you can get a lot of bang for your buck with Qwen3-1.7b. Granite-3B is also quite good (and obedient at tool calls, IIRC).
My every day driver is an ablit of Qwen3-4B 2507 instruct called Qwen HIVEMIND. I find it excellent…but again…black magic and clever tricks.
I’ve actually been scoping out the possibility of using ECA.dev and having something cheap / cloud based (say, GPT-5.4 mini) as the “brains” and SERA-8B as the “hands”.
GPT-5.4 mini is $0.75/M input tokens$4.50/M output tokens…and if it marries up with SERA-8B…well…that could go a long way indeed.
Small models can be made useful, as part of swarm architecture…but that’s not an apples : apples comparison.
I wanted to use smaller models, but then do more work on the “thinking” process. I didn’t come far, because it get so slow with normal hardware and too expensive on dedicated one. Time consuming (I’m also not a programmer) but a fun project, but in the end I just decided to satisfy the privacy angle with protons ai Lumo.
Proton has AI? Damn, that’s gotta be bleeding their coffers
Probably not; the models they use all tend to be quite lightweight and inexpensive, tbh.
EDIT:
https://proton.me/support/lumo-privacy
Open-source language models
Lumo is powered by open-source large language models (LLMs) which have been optimized by Proton to give you the best answer based on the model most capable of dealing with your request. The models we’re using currently are Nemo, OpenHands 32B, OLMO 2 32B, GPT-OSS 120B, Qwen, Ernie 4.5 VL 28B, Apertus, and Kimi K2. These run exclusively on servers Proton controls so your data is never stored on a third-party platform.
Lumo’s code is open source, meaning anyone can see it’s secure and does what it claims to. We’re constantly improving Lumo with the latest models that give the best user experience.
Quite lightweight swarm for cloud service, barring Kimi K2.
They have been working on this. Only 3 months ago it was pretty terrible. Today it’s almost on par with chatgpt. A bit worse on rag, slower,… good enough for normal use.
I was playing around with a tiny amount earlier today (I use ProtonMail, so I figured why not).
I can’t tell much about it. It seems very…safety theater / personality removed.
Any idea of what models they use now? I get a feeling that the main brain is 14B (based on how it responds to questions / drops nuance).