

We recommend setting temperature=0.8 and top_p=0.9 in the sampling parameters.
Try that. I believe those params are available I Kobold. Id that doesn’t work, send me a sample of what you’re doing and I’ll try it out


We recommend setting temperature=0.8 and top_p=0.9 in the sampling parameters.
Try that. I believe those params are available I Kobold. Id that doesn’t work, send me a sample of what you’re doing and I’ll try it out


How are you running it? Would you be able to post your run arguments?


This is really neat. Thank you. I would love a script or a more newb-friendly guide, not just for me, but for a lot of other users.
Can I make a suggestion? Post your script on github or similar with a proper (open) liscence so people can make suggestions or versions they find useful.


I’ve been on the internet a long time and this made me say “what the fuck” out loud
Edit: not sure whether I should ask what this all is or if ibshpuld complement you on your “output”


3090 24gb ($800 USD)
3060 12gb x 2 if you have 2 pcie slots (<$400 USD)
Radeon mi50 32gb with Vulkan (<$300 ) if you have more time, space, and will to tinker


Holy cow!


That fixed it.
I am a fan of this quant cook. He often posts perplexity charts.
https://huggingface.co/ubergarm
All of his quants require ik_llama which works best with Nvidia CUDA but they can do a lot with RAM+vRAM or even hard drive + rams. I don’t know if 8gb is enough for everything.


You are not alone. It blew my mind at how good it is per billion parameters. As an example, I can’t think of another model that will give you working code at 4B or less. I havent tried it on agentic tasks but that would be interesting


Im not sure if it’s a me issue but that’s a static image. I figure you posted where they throw a brick into it.
Also, if this post was serious, how does a highly quantitized model compare to something less quantitized but with fewer parameters? I haven’t seen benchmarks other than perplexity which isn’t a good measure of capability?


I don’t know if this is still useful for you, but I tried this out, mostly because I wanted to make sure I wasn’t crazy. Here’s my gpt-oss setup running on cheap AMD Instinct VRAM:
./llama-server \
--model {model}.gguf
--alias "gpt-oss-120b-mxfp4" \
--threads 16 \
-fa on\
--main-gpu 0 \
--ctx-size 64000 \
--n-cpu-moe 0 \
--n-gpu-layers 999 \
--temp 1.0 \
-ub 1536 \
-b 1536 \
--min-p 0.0 \
--top-p 1.0 \
--top-k 0.0 \
--jinja \
--host 0.0.0.0 \
--port 11343 \
--chat-template-kwargs '{"reasoning_effort": "medium"}'
I trimmed the content because it wasn’t relevant but left roughly the shape of the replies to give a sense of the verbosity.
Test 1: With default system message
user prompt: how do i calculate softmax in python
What is softmax
1 python + numpy
...
quick demo
...
2 SciPy
...
...
...
8 full script
...
running the script
...
results
...
TL;DR
...
followup prompt: how can i GPU-accelerate the function with torch
1 why pytorch is fast
...
...
**[Headers 2,3,4,5,6,7,8,9]**
...
...
TL;DR
...
Recap
...
Table Recap
...
Common pitfalls
...
Going beyond float32
...
10 Summary
...
Overall 6393 Tokens including reasoning
TEST 2 with this system prompt: You are a helpful coding assistant. Provide concise answers, to-the point answers. No fluff. Provide straightforward explanations when necessary. Do not add emoji and only provide tl;drs or summaries when asked.
user prompt: how do i calculate softmax in python
Softmax calculation in Python
...
Key points
...
followup prompt: how can i GPU-accelerate the function with torch
GPU‑accelerated Softmax with PyTorch
...
What the code does
...
Tips for larger workloads
...
Overall 1103 Tokens including reasoning


Totally. I think OSS is outright annoying with its verbosity. A system prompt will get around that


Qwen 3 or Qwen 3 Coder? Qwen3 comes in a 235B, 30B and smaller sizes. Qwen 3 Coder comes in a 30B or 480B size.
Open Router has multiple quant options and, for coding, I’d try to only use 8bit int or higher.
Claude also has a ton of sizes and deployment options with different capabilities.
As far as reasoning, the newest Deepseek V3.1 Terminus should be pretty good.
Honestly, all of these models should be able to help you up to a certain level with docker. I would double check how you connect to open router, making sure your hyperparams are good, making sure thinking/reasoning is enabled. Maybe try duck.ai and see if the models there are matching up to whatever you’re doing in open router.
Finally, not being a hater, but LLMs are not intelligent. They cannot actually reason or think. They can probabilistically align with answers you want to see. Sometimes your issue might be too weird or new for them to be able to give you a good answer. Even today models will give you docker compose files with a version number at the top, a feature which has been deprecated for over a year.
Edit: gpt-oss 120 should be cheap and capable enough. Available on duck.ai


I’m sure someone will give a better answer but this smells of a UEFI/secure boot problem. Look in your BIOS and turn those off or to legacy or “other os”


Chiming in to say this is a very reasonable starting place and wanted to highlight to op that this solution is 100% self-hosted


Yep. That was bleak.


I’m not close to web dev so I dont have context. Why is tailwind bad?
Seems like a not-too-bloated alternative to react.
I’ve been using the new UI since release and its been good.


It depends on what you mean.
To me, Ollama feels like it’s designed to be a developer-first, local LLM server with just enough functionality to get you to a POC, from where you’re intended to use someone else’s compute resources.
llama.cpp actually supports more backends, with continuous performance improvements and support for more models.


ROCm is a software stack which includes a bunch of SDKs and API.
HIP is a subset of ROCm which lets you program on AMD GPUs with focus portability from Nvidia’s CUDA
https://wiki.archlinux.org/title/Swap_on_video_RAM