Benchmarks are bullshit

SuspciousCarrot78@lemmy.world · 8 hours ago

Benchmarks are bullshit

SuspciousCarrot78@lemmy.world · 7 hours ago

3.5 might have been an odd duck, what with the new hybrid attention architecture, but I’m with you. Happy to eyeball the benchmarks, but for my uses, run my own.

Funnily enough, “vibe” benchmarks might actually be more useful. Thats one of the reasons why I enjoy Bijan Bowen’s content.

hendrik@palaver.p3x.de · edit-2 7 hours ago

Hmmh. I’ve tried to do benchmarks early on, about when Llama 2 was a thing… Followed the Reddit discussions. And then at some point I wanted to replace Mistral-Nemo with something newer but I disliked how every other model had turned to the ChatGPT / sycophant style of talking… But it’s a massively laborious undertaking. The official benchmarks don’t cover any of that. And there’s no good way to automate it either. So I spend half a day reading output manually and rating it in an Excel spreadsheet. With some success, but it’s way too complicated. So I mainly eyeball it these days. And sometimes there’s some recommendations somewhere on the internet. And I learned to accept how Chatbots always go on and on with redundant information unless I tell them skip the bullshit, I have an appointment at the hairdresser in 10 minutes and you need to explain it in 3 sentences. 😄

I suppose for tasks like coding, or factual knowledge it’s way easier to come up with fully automated benchmarks.