• 3 Posts
  • 143 Comments
Joined 2 years ago
cake
Cake day: June 14th, 2023

help-circle
  • Intel NUC running Linux. Not the cheapest solution but can play anything and I have full control over it. At first I tried to find some kind of programmable remote but now we have a wireless keyboard with built-in touchpad.

    Biggest downside is that the hardware quality is kind of questionable and the first two broke after 3 years + a few months, so we’re on our third now.














  • We just had Windows Update brick itself due to a faulty update. The fix required updating them manually while connected to the office network, making them unusable for 2-3 hours. Another issue we’ve had is that Windows appears to be monopolizing virtualization HW acceleration for some memory integrity protection, which made our VMs slow and laggy. Fixing it required a combination of shell commands, settings changes and IT support remotely changing some permission, but the issue also comes back after some updates.

    Though I’ve also had quite a lot of Windows problems at home, when I was still using it regularly. Not saying Linux usage has been problem free, but there I can at least fix things. Windows has a tendency to give unusable error messages and make troubleshooting difficult, and even when you figure out what’s wrong you’re at the mercy of Microsoft if you are allowed to change things on your own computer, due to their operating system’s proprietary nature.



  • Article is written in a bit confusing way, but you’ll most likely want to turn off Nvidia’s automatic VRAM swapping if you’re on Windows, so it doesn’t happen by accident. Partial offloading with llama.cpp is much faster AFAIK if you want to split the model between GPU and CPU, and it’s easier to find how many layers you can offload if it fails to load instead when you set it too high.

    Also if you want to experiment partial offload, maybe a 12B around Q4 would be more interesting than the same 7B model with higher precision? I haven’t checked if anything new has come out the last couple of months, but Mistral Nemo is fairly good IMO, though you might need to limit context to 4k or something.


  • Mixtral in particular runs great with partial offloading, I used a Q4_K_M quant while only having 12GB VRAM.

    To answer your original question I think it depends on the model and use case. Complex logic such as programming seems to suffer the most from quantization, while RP/chat can take much heaver quantization while staying coherent. I think most people think quantization around 4-5 bpw gives the best value, and you really get diminishing returns over 6 bpw so I know few who thinks it’s worth using 8 bpw.

    Personally I always use as large models as I can. With Q2 quantization the 70B models I’ve used occasionally give bad results, but often they feel smarter than 35B Q4. Though it’s ofc. difficult to compare models from completely different families, e.g. command-r vs llama, and there are not that many options in the 30B range. I’d take a 35B Q4 over a 12B Q8 any day though, and 12B Q4 over 7B Q8 etc. In the end I think you’ll have to test yourself, and see which model and quant combination you think gives best result at the inference speed you consider usable.