I guess that’s been my general experience for a while. I’d download some new model with promising benchmarks. And once I try it, the results are kinda underwhelming. A few weeks ago -for example- I tried Qwen 3.5, which had “outstanding results across a full range of benchmark evaluations”. And I deleted it after it kept wasting thousands of tokens to reason about how to respond to a “Hello” by the user. And sometimes I just don’t see any real performance improvement with new models. If I had to guess, I’d say they mainly trained (and improved) for/on the benchmarks, not my use-case.
3.5 might have been an odd duck, what with the new hybrid attention architecture, but I’m with you. Happy to eyeball the benchmarks, but for my uses, run my own.
Funnily enough, “vibe” benchmarks might actually be more useful. Thats one of the reasons why I enjoy Bijan Bowen’s content.
Hmmh. I’ve tried to do benchmarks early on, about when Llama 2 was a thing… Followed the Reddit discussions. And then at some point I wanted to replace Mistral-Nemo with something newer but I disliked how every other model had turned to the ChatGPT / sycophant style of talking… But it’s a massively laborious undertaking. The official benchmarks don’t cover any of that. And there’s no good way to automate it either. So I spend half a day reading output manually and rating it in an Excel spreadsheet. With some success, but it’s way too complicated. So I mainly eyeball it these days. And sometimes there’s some recommendations somewhere on the internet. And I learned to accept how Chatbots always go on and on with redundant information unless I tell them skip the bullshit, I have an appointment at the hairdresser in 10 minutes and you need to explain it in 3 sentences. 😄
I suppose for tasks like coding, or factual knowledge it’s way easier to come up with fully automated benchmarks.
I guess that’s been my general experience for a while. I’d download some new model with promising benchmarks. And once I try it, the results are kinda underwhelming. A few weeks ago -for example- I tried Qwen 3.5, which had “outstanding results across a full range of benchmark evaluations”. And I deleted it after it kept wasting thousands of tokens to reason about how to respond to a “Hello” by the user. And sometimes I just don’t see any real performance improvement with new models. If I had to guess, I’d say they mainly trained (and improved) for/on the benchmarks, not my use-case.
3.5 might have been an odd duck, what with the new hybrid attention architecture, but I’m with you. Happy to eyeball the benchmarks, but for my uses, run my own.
Funnily enough, “vibe” benchmarks might actually be more useful. Thats one of the reasons why I enjoy Bijan Bowen’s content.
Hmmh. I’ve tried to do benchmarks early on, about when Llama 2 was a thing… Followed the Reddit discussions. And then at some point I wanted to replace Mistral-Nemo with something newer but I disliked how every other model had turned to the ChatGPT / sycophant style of talking… But it’s a massively laborious undertaking. The official benchmarks don’t cover any of that. And there’s no good way to automate it either. So I spend half a day reading output manually and rating it in an Excel spreadsheet. With some success, but it’s way too complicated. So I mainly eyeball it these days. And sometimes there’s some recommendations somewhere on the internet. And I learned to accept how Chatbots always go on and on with redundant information unless I tell them skip the bullshit, I have an appointment at the hairdresser in 10 minutes and you need to explain it in 3 sentences. 😄
I suppose for tasks like coding, or factual knowledge it’s way easier to come up with fully automated benchmarks.