When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.

I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).

Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )

And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.

I’m rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.

But these edge models are the ‘curiosity category’ now.

  • ☂️-@lemmy.ml
    link
    fedilink
    English
    arrow-up
    7
    ·
    16 hours ago

    is it just me or the smaller models that fit in my vram are very dumb?

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      edit-2
      8 hours ago

      It’s not just you. But while they may be natively “dumb”, they can be augmented quite significantly. Even adding a simple web-search tool can help a lot.

      So, there are levels of “dumb”. Some - like Qwen3-4B 2507 instruct - may not have the world knowledge of a SOTA, but its reasoning abilities can be quite impressive. See HERE as an example of a self made test suite. You can run something similar yourself.

      I guess it depends what you mean by “dumb” and how that affects what you’re trying to do with them. Some are dumb at tool use, some have poor world knowledge etc. You can find small models that are good at what’s important to you if you dig around. Except for coding - that’s rough. Probably the smallest stand-alone that might make you sit up and pay attention is something like Qwen2.5-Coder-14B-Instruct or FrogMini-14B-2510…but I wouldn’t trust them to go spelunking a code base.

      • ☂️-@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        how are some other ways to make it better beyond just adding a search tool? is 16gb vram sufficient for usable results?

        where do you think is the best place to go into this rabbit hole?

  • fozid@feddit.uk
    link
    fedilink
    English
    arrow-up
    4
    ·
    14 hours ago

    For me, anything less than gpt oss 20b (a2b) is just for messing around with or for basic categorisation and basic text or data processing with highly structured prompts.

  • inconel@lemmy.ca
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    14 hours ago

    For small model bonsai series seems getting the spotlight. Natively trained on1bit and ternary 1.58bit, 8B runs on ~1GB memory. I’m curios on local models but haven’t tried because of lack of gaming rig but it seems work enough for regular pc

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      8 hours ago

      I’m glad to see 1.58Bs finally starting to appear.

      I got GPT to side-by-side the benchmarks (for what they are worth). Bonsai 8B seems to be a cook off from Qwen3-8B. If they can squeeze an 8B into 1GB…then perhaps we can get a 20-30B in 4gb soon.

      Category Bonsai-8B-gguf Qwen3-4B-Instruct-2507
      Base / lineage Compressed Qwen3-8B dense architecture in 1-bit GGUF Q1_0 form (Hugging Face) Official Qwen3 4B instruct release from Alibaba/Qwen (Hugging Face)
      Params 8.19B total, ~6.95B non-embedding (Hugging Face) 4.0B total, 3.6B non-embedding (Hugging Face)
      Layers / heads 36 layers, GQA 32 Q / 8 KV (Hugging Face) 36 layers, GQA 32 Q / 8 KV (Hugging Face)
      Context length 65,536 tokens (Hugging Face) 262,144 tokens native (Hugging Face)
      Format GGUF Q1_0, end-to-end 1-bit weights (Hugging Face) Standard full model release; quantized variants exist elsewhere, but the official card here is the base instruct model (Hugging Face)
      Deployed size / memory 1.15 GB deployed; Prism says 14.2x smaller than FP16 (Hugging Face) Card does not list one deployed size on-page; it is a normal 4B model, so materially larger than Bonsai in practice (Hugging Face)
      Stated goal Extreme compression, speed, and efficiency while staying “competitive” with 8B-class models (Hugging Face) Strong general-purpose instruct model with gains in reasoning, coding, writing, tool use, and long-context handling (Hugging Face)
      Published benchmark bundle EvalScope bundle across MMLU-R, MuSR, GSM8K, HE+, IFEval, BFCL with 70.5 avg (Hugging Face) Broader Qwen benchmark suite including MMLU-Pro, GPQA, AIME25, ZebraLogic, LiveBench, LiveCodeBench, IFEval, Arena-Hard v2, BFCL-v3, plus agent/multilingual tasks (Hugging Face)
      Knowledge benchmark MMLU-R 65.7 (Hugging Face) MMLU-Pro 69.6, MMLU-Redux 84.2, GPQA 62.0, SuperGPQA 42.8 (Hugging Face)
      Reasoning benchmark MuSR 50, GSM8K 88 (Hugging Face) AIME25 47.4, HMMT25 31.0, ZebraLogic 80.2, LiveBench 63.0 (Hugging Face)
      Coding benchmark HumanEval+ 73.8 (Hugging Face) LiveCodeBench 35.1, MultiPL-E 76.8, Aider-Polyglot 12.9 (Hugging Face)
      Instruction following / alignment IFEval 79.8 (Hugging Face) IFEval 83.4, Arena-Hard v2 43.4, Creative Writing v3 83.5, WritingBench 83.4 (Hugging Face)
      Tool / agent metrics BFCL 65.7 (Hugging Face) BFCL-v3 61.9, TAU1-Retail 48.7, TAU1-Airline 32.0, TAU2-Retail 40.4 (Hugging Face)
      Speed claims Prism reports 368 tok/s on RTX 4090 vs 59 tok/s FP16 baseline, plus strong gains on other hardware (Hugging Face) The model card here emphasizes capability and deployment support, not a comparable on-page throughput table (Hugging Face)
      Energy claims Prism reports 4.1x better energy/token on RTX 4090 and 5.1x on M4 Pro vs FP16 baselines (Hugging Face) No equivalent on-page energy table in this card (Hugging Face)
      Best practical use Tiny footprint, fast local inference, “how is this running here?” deployments (Hugging Face) Better bet for raw reasoning, writing, long context, and general instruction-following (Hugging Face)
  • NoiseColor @lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    19 hours ago

    For what stuff do you want to use them? I don’t think they come remotely close to today’s commercial models. Maybe for a specific purpose?

    • ntn888@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      19 hours ago

      hey, thanks for your response… yeah that’s what I meant, the 2b models aren’t usable in today’s state, but more practical for everyday use if they work out…

      I actually meant the 31b models are useful for my purpose. I don’t do full-on agentic coding, just interactive chat/prompting. Example, I make good use for making linux shell scripts (as I don’t know howto myself). Currently I use qwen3.5-flash via cloud. It’s as good as the frontier models back then if not better…

      • SuspciousCarrot78@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        14 hours ago

        There are several 3B or less models that are surprisingly good. If you’re talking about a general chat model, you can get a lot of bang for your buck with Qwen3-1.7b. Granite-3B is also quite good (and obedient at tool calls, IIRC).

        My every day driver is an ablit of Qwen3-4B 2507 instruct called Qwen HIVEMIND. I find it excellent…but again…black magic and clever tricks.

        I’ve actually been scoping out the possibility of using ECA.dev and having something cheap / cloud based (say, GPT-5.4 mini) as the “brains” and SERA-8B as the “hands”.

        GPT-5.4 mini is $0.75/M input tokens$4.50/M output tokens…and if it marries up with SERA-8B…well…that could go a long way indeed.

        Small models can be made useful, as part of swarm architecture…but that’s not an apples : apples comparison.

      • NoiseColor @lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        18 hours ago

        I wanted to use smaller models, but then do more work on the “thinking” process. I didn’t come far, because it get so slow with normal hardware and too expensive on dedicated one. Time consuming (I’m also not a programmer) but a fun project, but in the end I just decided to satisfy the privacy angle with protons ai Lumo.

        • inari@piefed.zip
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          16 hours ago

          Proton has AI? Damn, that’s gotta be bleeding their coffers

          • SuspciousCarrot78@lemmy.world
            link
            fedilink
            English
            arrow-up
            5
            ·
            edit-2
            14 hours ago

            Probably not; the models they use all tend to be quite lightweight and inexpensive, tbh.

            EDIT:
            https://proton.me/support/lumo-privacy


            Open-source language models

            Lumo is powered by open-source large language models (LLMs) which have been optimized by Proton to give you the best answer based on the model most capable of dealing with your request. The models we’re using currently are Nemo, OpenHands 32B, OLMO 2 32B, GPT-OSS 120B, Qwen, Ernie 4.5 VL 28B, Apertus, and Kimi K2. These run exclusively on servers Proton controls so your data is never stored on a third-party platform.

            Lumo’s code is open source, meaning anyone can see it’s secure and does what it claims to. We’re constantly improving Lumo with the latest models that give the best user experience.


            Quite lightweight swarm for cloud service, barring Kimi K2.

            • NoiseColor @lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              13 hours ago

              They have been working on this. Only 3 months ago it was pretty terrible. Today it’s almost on par with chatgpt. A bit worse on rag, slower,… good enough for normal use.

              • SuspciousCarrot78@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                edit-2
                8 hours ago

                I was playing around with a tiny amount earlier today (I use ProtonMail, so I figured why not).

                I can’t tell much about it. It seems very…safety theater / personality removed.

                Any idea of what models they use now? I get a feeling that the main brain is 14B (based on how it responds to questions / drops nuance).

  • SuspciousCarrot78@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    15 hours ago

    As I recall there are some new tricks that allow up to 8B models to run on a Raspberry Pi 5 and around 10-15 tokens per second with --ctx 32768. I haven’t kept across it because I don’t visit Reddit but that was my last recollection. If you fossick over there, you may be able to find it. Or use kagi.com to find it, heh.

    One of the goals of the harness that I built was to reduce memory pressure, particularly KV cache, so that you could run larger models on more constrained hardware, but I’m not here to spruik myself. I’m just letting you know that there are ways and means to get it done on SBCs.

    EDIT: I “kagi’ed” it for you. Here

    qwen3.5 9B Q8_0 8.86 GiB 8.95 B CPU 4 0 pp512 18.20 ± 0.23 tok/s