Recently I’ve been experimenting with Claude and feeling the burn on the premium API usage. I wanted to know how much cheaper my local llm was in terms of cost-per-token output.

Claude Sonnet is a good reference with 15$ per 1 million tokens out, so I wanted to know comparatively how many tokens 15$ worth electricity powering my rig would generate.

(These calculations are just simple raw token generation by the way, in real world theres cost in initial hardware, ongoing maintenance as parts fail, and human time to setup thats much harder to factor into the equation)

So how does one even calculate such a thing? Well, you need to know

  1. how many watts your inference rig consumes at load
  2. how many tokens on average it can generate per second while inferencing (with context relatively filled up, we want conservative estimates)
  3. cost of electric you pay on the utility bill in kilowatts-per-hour

Once you have those constants you can extrapolate how many kilowatt-hours worth of runtime 15$ in electric buys then figure out the total amount of tokens you would expect to generate over that time given the TPS.

The numbers shown in the screenshot are for a fully loaded into vram model on the ol’ 1070ti 8gb. But even with partially offloaded numbers for 22-32b models at 1-3tps its still a better deal overall.

I plan to offer the calculator as a tool on my site and release it under a permissive license like gpl if anyone is interested.

  • slacktoid@lemmy.ml
    link
    fedilink
    English
    arrow-up
    11
    ·
    4 months ago

    Not to be that guy (he says as he becomes that guy) but the GPL is not a permissive license, BSD and MIT are. Tho imo GPL is the better and probably best license.

    Also what models and use cases did you run it for? And what was your context window?

    • SmokeyDope@lemmy.worldOPM
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      1
      ·
      edit-2
      4 months ago

      Thanks for being that guy, good to know. Those specific numbers shown were just done tonight with DeepHermes 8b q6km (finetuned from llama 3.1 8b) with max context at 8192, in the past before I reinstalled I managed to squeeze ~10k context with the 8b by booting without a desktop enviroment. I happen to know that DeepHermes 22b iq3 (finetuned from mistral small) runs at like 3 tps partially offloaded with 4-5k context.

      Deephermes 8b is the fast and efficient general model I use for general conversation, basic web search, RAG, data table formatting/basic markdown generation, simple computations with deepseek r1 distill reasoning CoT turned on.

      Deephermes 22b is the local powerhouse model I use for more complex task requiring either more domain knowledge or reasoning ability. For example to help break down legacy code and boilerplate simple functions for game creation.

      I have vision model + TTS pipeline for OCR scanning and narration using qwen 2.5vl 7b + outetts+wavtokenizer which I was considering trying to calculate though I need to add up both the llm tps and the audio TTS tps.

      I plan to load up a stable diffusion model and see how image generation compares but the calculations will probably be slightly different.

      I hear theres one or two local models floating around that work with roo-cline for the advanced tool usage, if I can find a local model in the 14b range that works with roo even if just for basic stuff it will be incredible.

      Hope that helps inform you sorry if I missed something.

      • slacktoid@lemmy.ml
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 months ago

        You’re good. I’m trying to get larger context windows on my models so trying to figure that out and balance token throughput. I do appreciate your insights into the different use cases.

        Have you tried larger 70b models? Or compared against larger MoE models?

        • SmokeyDope@lemmy.worldOPM
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          4 months ago

          I have not tried any models larger than very low quant qwen 32b . My personal limits for partial offloading speeds are 1 tps and the 32b models encroach on that. Once I get my vram upgraded from 8gb to 16-24gb ill test the waters with higher parameters and hit some new limits to benchmark :) I haven’t tried out MoE models either, I keep hearing about them. AFAIK they’re popular with people because you can do advanced partial offloading strategies between different experts to really bump the token generation. So playing around with them has been on my ml bucket list for awhile.

          • slacktoid@lemmy.ml
            link
            fedilink
            English
            arrow-up
            3
            ·
            4 months ago

            Dude! That’s so dope. I would really like your insights in how you tuned MoE. That would be a game changer as you can swap out unnecessary layers from the GPU and still get the benefit of using a bigger model and stuff.

            Yeah it’s a little hard to do inference with these limited VRAM situations and larger contexts. That’s a massive pain

            • SmokeyDope@lemmy.worldOPM
              link
              fedilink
              English
              arrow-up
              3
              ·
              edit-2
              4 months ago

              I don’t have a lot of knowledge on the topic but happy to point you in good direction for reference material. I heard about tensor layer offloading first from here a few months ago. In that post is linked another to MoE expert layer offloadingI highly recommend you read through both post. MoE offloading it was based off

              The gist of the Tensor Cores strategy is Instead of offloading entire layers with --gpulayers, you use --overridetensors to keep specific large tensors (particularly FFN tensors) on CPU while moving everything else to GPU.

              This works because:

              • Attention tensors: Small, benefit greatly from GPU parallelization
              • FFN tensors: Large, can be efficiently processed on CPU with basic matrix multiplication

              You need to figure out which cores exactly need to be offloaded for your model looking at weights and cooking up regex according to the post.

              Heres an example of a kobold startup flags for doing this. The key part is the override tensors flag and the regex contained in it

              python ~/koboldcpp/koboldcpp.py --threads 10 --usecublas --contextsize 40960 --flashattention --port 5000 --model ~/Downloads/MODELNAME.gguf --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU"
              ...
              [18:44:54] CtxLimit:39294/40960, Amt:597/2048, Init:0.24s, Process:68.69s (563.34T/s), Generate:56.27s (10.61T/s), Total:124.96s
              

              The exact specifics of how you determine which tensors for each model and the associated regex is a little beyond my knowledge but the people who wrote the tensor post did a good job trying to explain that process in detail. Hope this helps.

              • slacktoid@lemmy.ml
                link
                fedilink
                English
                arrow-up
                2
                ·
                4 months ago

                Damn! Thank you so much. This is very helpful and a great starting point for me to mess about to make the most of my LLM setup. Appreciate it!!

                • brucethemoose@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  edit-2
                  3 months ago

                  Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090.

                  If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.

                  • slacktoid@lemmy.ml
                    link
                    fedilink
                    English
                    arrow-up
                    1
                    ·
                    3 months ago

                    I need to mess with tabbyapi. Doesn’t help that there’s like 2 tabbys, one is tabbyapi and the other is tabbyml. I am guessing tool support is at its infancy stage.