Replaced $40/month in AI API subscriptions with self-hosted Ollama + n8n

quickbitesdev@discuss.tchncs.de · 1 month ago

Replaced $40/month in AI API subscriptions with self-hosted Ollama + n8n

TheMightyCat@ani.social · edit-2 1 month ago

Depending what OP was using before but going from something like GPT5.2 to LLama 3 8B will be a massive difference (Although OP says to use it only for basic tasks so that does offset it)

LLama 3 already being a very old model doesn’t help either

I run Qwen3.5-35B-A3B-AWQ-4bit which while leagues ahead of LLama 3 8B still is a very noticeable difference.

This is not to say open source is bad, if one had the resources to run something like Qwen3.5-397B-A17B it would also be up there.

Valmond@lemmy.dbzer0.com · 1 month ago

What kind of hardware do you need to run those models?

TheMightyCat@ani.social · edit-2 1 month ago

I’m running 2x4090, the 35B fits very comfortable in that.

For large models like the 397B without a ton of money there are several ways, ive seen posts of people using arrays of used 3090s with good results.

The other option is CPU inference although with current RAM prices that is less cost effective.

I was looking at maybe an array of Milk-V JUPITER2 since vllm added riscv support which could be very cost effective.

Jakeroxs@sh.itjust.works · 1 month ago

Depends on how much quantization, but still fairly beefy, couldn’t run it on my homelab with a 3080ti for example.

I generally use smaller 8-12b models and they’re alright depending on the task.

suicidaleggroll@lemmy.world · edit-2 1 month ago

In general, you take the model size in billions of parameters (eg: 397B), divide it by 2 and add a bit for overhead, and that’s how much RAM/VRAM it takes to run it at a “normal” quantization level. For Qwen3.5-397B, that’s about 220 GB. Ideally that would be all VRAM for speed, but you can offload some or all of that to normal RAM on the CPU, you’ll just take a speed hit.

So for something like Qwen3.5-397B, it takes a pretty serious system, especially if you’re trying to do it all in VRAM.