Recently I’ve been experimenting with Claude and feeling the burn on the premium API usage. I wanted to know how much cheaper my local llm was in terms of cost-per-token output.

Claude Sonnet is a good reference with 15$ per 1 million tokens out, so I wanted to know comparatively how many tokens 15$ worth electricity powering my rig would generate.

(These calculations are just simple raw token generation by the way, in real world theres cost in initial hardware, ongoing maintenance as parts fail, and human time to setup thats much harder to factor into the equation)

So how does one even calculate such a thing? Well, you need to know

  1. how many watts your inference rig consumes at load
  2. how many tokens on average it can generate per second while inferencing (with context relatively filled up, we want conservative estimates)
  3. cost of electric you pay on the utility bill in kilowatts-per-hour

Once you have those constants you can extrapolate how many kilowatt-hours worth of runtime 15$ in electric buys then figure out the total amount of tokens you would expect to generate over that time given the TPS.

The numbers shown in the screenshot are for a fully loaded into vram model on the ol’ 1070ti 8gb. But even with partially offloaded numbers for 22-32b models at 1-3tps its still a better deal overall.

I plan to offer the calculator as a tool on my site and release it under a permissive license like gpl if anyone is interested.

  • rebelsimile@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    6
    ·
    2 days ago

    I do all my local LLM-ing on an M1 Max macbook pro with a power draw of around 40-60 Watts (which for my use cases is probably about 10 minutes a day in total). I definitely believe we can be more efficient running these models at home.

    • wise_pancake@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      I wish I’d sprung for the max when I bought my M1 Pro, but I am glad I splurged on memory. Really aside from LLM workloads this thing is still excellent.

      Agree we can be doing a lot more, the recent generation of local models are fantastic.

      Gemma 3n and Phi 4 (non reasoning) are my local workhorses lately.