When I first got into local LLMs nearly 3 years ago, in mid 2023, the frontier closed models were ofcourse impressively capable.

I then tried my hand on running 7b size local models, primarily one called Zephyr-7b (what happened to these models?? Dolphin anyone??), on my gaming PC with 8GB AMD RX580 GPU. Fair to say it was just a curiosity exercise (in terms of model performance).

Fast forward to this month, I revisit local LLM. (Although I no longer have the gaming PC, cost-of-living-crisis anyone 😫 )

And, the 31b size models look very sufficient. #Qwen has taken the helm in this order. Which is still very expensive to setup locally, although within grasp.

I’m rooting for the edge-computing models now - the ~2b size models. Due to their low footprint, they are practical to run in a SBC 24/7 at home for many people.

But these edge models are the ‘curiosity category’ now.

  • SuspciousCarrot78@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    9 hours ago

    I’m glad to see 1.58Bs finally starting to appear.

    I got GPT to side-by-side the benchmarks (for what they are worth). Bonsai 8B seems to be a cook off from Qwen3-8B. If they can squeeze an 8B into 1GB…then perhaps we can get a 20-30B in 4gb soon.

    Category Bonsai-8B-gguf Qwen3-4B-Instruct-2507
    Base / lineage Compressed Qwen3-8B dense architecture in 1-bit GGUF Q1_0 form (Hugging Face) Official Qwen3 4B instruct release from Alibaba/Qwen (Hugging Face)
    Params 8.19B total, ~6.95B non-embedding (Hugging Face) 4.0B total, 3.6B non-embedding (Hugging Face)
    Layers / heads 36 layers, GQA 32 Q / 8 KV (Hugging Face) 36 layers, GQA 32 Q / 8 KV (Hugging Face)
    Context length 65,536 tokens (Hugging Face) 262,144 tokens native (Hugging Face)
    Format GGUF Q1_0, end-to-end 1-bit weights (Hugging Face) Standard full model release; quantized variants exist elsewhere, but the official card here is the base instruct model (Hugging Face)
    Deployed size / memory 1.15 GB deployed; Prism says 14.2x smaller than FP16 (Hugging Face) Card does not list one deployed size on-page; it is a normal 4B model, so materially larger than Bonsai in practice (Hugging Face)
    Stated goal Extreme compression, speed, and efficiency while staying “competitive” with 8B-class models (Hugging Face) Strong general-purpose instruct model with gains in reasoning, coding, writing, tool use, and long-context handling (Hugging Face)
    Published benchmark bundle EvalScope bundle across MMLU-R, MuSR, GSM8K, HE+, IFEval, BFCL with 70.5 avg (Hugging Face) Broader Qwen benchmark suite including MMLU-Pro, GPQA, AIME25, ZebraLogic, LiveBench, LiveCodeBench, IFEval, Arena-Hard v2, BFCL-v3, plus agent/multilingual tasks (Hugging Face)
    Knowledge benchmark MMLU-R 65.7 (Hugging Face) MMLU-Pro 69.6, MMLU-Redux 84.2, GPQA 62.0, SuperGPQA 42.8 (Hugging Face)
    Reasoning benchmark MuSR 50, GSM8K 88 (Hugging Face) AIME25 47.4, HMMT25 31.0, ZebraLogic 80.2, LiveBench 63.0 (Hugging Face)
    Coding benchmark HumanEval+ 73.8 (Hugging Face) LiveCodeBench 35.1, MultiPL-E 76.8, Aider-Polyglot 12.9 (Hugging Face)
    Instruction following / alignment IFEval 79.8 (Hugging Face) IFEval 83.4, Arena-Hard v2 43.4, Creative Writing v3 83.5, WritingBench 83.4 (Hugging Face)
    Tool / agent metrics BFCL 65.7 (Hugging Face) BFCL-v3 61.9, TAU1-Retail 48.7, TAU1-Airline 32.0, TAU2-Retail 40.4 (Hugging Face)
    Speed claims Prism reports 368 tok/s on RTX 4090 vs 59 tok/s FP16 baseline, plus strong gains on other hardware (Hugging Face) The model card here emphasizes capability and deployment support, not a comparable on-page throughput table (Hugging Face)
    Energy claims Prism reports 4.1x better energy/token on RTX 4090 and 5.1x on M4 Pro vs FP16 baselines (Hugging Face) No equivalent on-page energy table in this card (Hugging Face)
    Best practical use Tiny footprint, fast local inference, “how is this running here?” deployments (Hugging Face) Better bet for raw reasoning, writing, long context, and general instruction-following (Hugging Face)