

Yea… it’s not quite the same thing to actually run DeepSeek R1, a 671B model, and for example DeepSeek-R1-Distill-Qwen-1.5B
Yea… it’s not quite the same thing to actually run DeepSeek R1, a 671B model, and for example DeepSeek-R1-Distill-Qwen-1.5B
As long as they are talking about normal things and not playing D&D 😃
You have to specify which quantization you find acceptable, and which context size you require. I think the most affordable option to run large models locally is still getting multiple RTX3090 cards, and I guess you probably need 3 or 4 of those depending on quantization and context.
My paranoia level: Even though I’m pretty good with computers in general, I would not trust myself to set up a safe public facing service, which is the reason that I don’t have any of those on my home server. If I needed something like that I wouldn’t self host it.
The question is probably more related to what he has done rather than what he is doing right now, and he is kind of famous for having created Linux in the past. To someone who doesn’t know anything about Linux licenses I think it would be easy to suspect that Torvalds might have some kind of ownership of his creation.
If you’re using Pipewire, have you checked if you can re-route the audio sink using qpwgraph?
I was expecting more entries on a certain theme for version 420 ;)
Apex with EAC worked perfectly fine on Linux for the last 2 years, EA just decided to break it by replacing EAC with their own anti-cheat which is Windows only.
Yea, the examples are not explained well, but if they’re using some other software/code for inference it might be configured wrong, or not compatible with qwen-coder FIM, and produce strange results for that reason.
Have you seen the examples for these models? https://github.com/QwenLM/Qwen2.5-Coder?tab=readme-ov-file#3-file-level-code-completion-fill-in-the-middle
Add “site:reddit.com” to your google query.
Sad thing is that search engines have got so bad, and usually return so much garbage blog spam that searching directly on reddit is more likely to give useful results. I hope a similar amount of knowledge will build up on Lemmy over time.
We just had Windows Update brick itself due to a faulty update. The fix required updating them manually while connected to the office network, making them unusable for 2-3 hours. Another issue we’ve had is that Windows appears to be monopolizing virtualization HW acceleration for some memory integrity protection, which made our VMs slow and laggy. Fixing it required a combination of shell commands, settings changes and IT support remotely changing some permission, but the issue also comes back after some updates.
Though I’ve also had quite a lot of Windows problems at home, when I was still using it regularly. Not saying Linux usage has been problem free, but there I can at least fix things. Windows has a tendency to give unusable error messages and make troubleshooting difficult, and even when you figure out what’s wrong you’re at the mercy of Microsoft if you are allowed to change things on your own computer, due to their operating system’s proprietary nature.
Already? I’m still using Fedora 39 since that’s the only version supported by CUDA Toolkit :S
Article is written in a bit confusing way, but you’ll most likely want to turn off Nvidia’s automatic VRAM swapping if you’re on Windows, so it doesn’t happen by accident. Partial offloading with llama.cpp is much faster AFAIK if you want to split the model between GPU and CPU, and it’s easier to find how many layers you can offload if it fails to load instead when you set it too high.
Also if you want to experiment partial offload, maybe a 12B around Q4 would be more interesting than the same 7B model with higher precision? I haven’t checked if anything new has come out the last couple of months, but Mistral Nemo is fairly good IMO, though you might need to limit context to 4k or something.
Mixtral in particular runs great with partial offloading, I used a Q4_K_M quant while only having 12GB VRAM.
To answer your original question I think it depends on the model and use case. Complex logic such as programming seems to suffer the most from quantization, while RP/chat can take much heaver quantization while staying coherent. I think most people think quantization around 4-5 bpw gives the best value, and you really get diminishing returns over 6 bpw so I know few who thinks it’s worth using 8 bpw.
Personally I always use as large models as I can. With Q2 quantization the 70B models I’ve used occasionally give bad results, but often they feel smarter than 35B Q4. Though it’s ofc. difficult to compare models from completely different families, e.g. command-r vs llama, and there are not that many options in the 30B range. I’d take a 35B Q4 over a 12B Q8 any day though, and 12B Q4 over 7B Q8 etc. In the end I think you’ll have to test yourself, and see which model and quant combination you think gives best result at the inference speed you consider usable.
On Linux, AMD GPUs work significantly better than Nvidia ones. If you have a choice, choose an AMD
Unless you’re interested in AI stuff, then Nvidia is still the best choice. Some libraries are HW accelerated on AMD, and hopefully more will work in the future.
Ofc I know it’s not meant to be literal, but talking about killing black people or not is too direct. The subjects people like this usually want to talk about tend to be more layered, e.g. “what should we do about the Jew problem” so that if you take the bait you’ll implicitly accept that “the Jew problem” exists to begin with.
Isn’t (I|U)
equivalent to ([IU])
?
Intel NUC running Linux. Not the cheapest solution but can play anything and I have full control over it. At first I tried to find some kind of programmable remote but now we have a wireless keyboard with built-in touchpad.
Biggest downside is that the hardware quality is kind of questionable and the first two broke after 3 years + a few months, so we’re on our third now.