Recently I’ve been experimenting with Claude and feeling the burn on the premium API usage. I wanted to know how much cheaper my local llm was in terms of cost-per-token output.
Claude Sonnet is a good reference with 15$ per 1 million tokens out, so I wanted to know comparatively how many tokens 15$ worth electricity powering my rig would generate.
(These calculations are just simple raw token generation by the way, in real world theres cost in initial hardware, ongoing maintenance as parts fail, and human time to setup thats much harder to factor into the equation)
So how does one even calculate such a thing? Well, you need to know
- how many watts your inference rig consumes at load
- how many tokens on average it can generate per second while inferencing (with context relatively filled up, we want conservative estimates)
- cost of electric you pay on the utility bill in kilowatts-per-hour
Once you have those constants you can extrapolate how many kilowatt-hours worth of runtime 15$ in electric buys then figure out the total amount of tokens you would expect to generate over that time given the TPS.
The numbers shown in the screenshot are for a fully loaded into vram model on the ol’ 1070ti 8gb. But even with partially offloaded numbers for 22-32b models at 1-3tps its still a better deal overall.
I plan to offer the calculator as a tool on my site and release it under a permissive license like gpl if anyone is interested.
Dude! That’s so dope. I would really like your insights in how you tuned MoE. That would be a game changer as you can swap out unnecessary layers from the GPU and still get the benefit of using a bigger model and stuff.
Yeah it’s a little hard to do inference with these limited VRAM situations and larger contexts. That’s a massive pain
I don’t have a lot of knowledge on the topic but happy to point you in good direction for reference material. I heard about tensor layer offloading first from here a few months ago. In that post is linked another to MoE expert layer offloadingI highly recommend you read through both post. MoE offloading it was based off
The gist of the Tensor Cores strategy is Instead of offloading entire layers with --gpulayers, you use --overridetensors to keep specific large tensors (particularly FFN tensors) on CPU while moving everything else to GPU.
This works because:
You need to figure out which cores exactly need to be offloaded for your model looking at weights and cooking up regex according to the post.
Heres an example of a kobold startup flags for doing this. The key part is the override tensors flag and the regex contained in it
python ~/koboldcpp/koboldcpp.py --threads 10 --usecublas --contextsize 40960 --flashattention --port 5000 --model ~/Downloads/MODELNAME.gguf --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU" ... [18:44:54] CtxLimit:39294/40960, Amt:597/2048, Init:0.24s, Process:68.69s (563.34T/s), Generate:56.27s (10.61T/s), Total:124.96s
The exact specifics of how you determine which tensors for each model and the associated regex is a little beyond my knowledge but the people who wrote the tensor post did a good job trying to explain that process in detail. Hope this helps.
Damn! Thank you so much. This is very helpful and a great starting point for me to mess about to make the most of my LLM setup. Appreciate it!!