Hmmh. I’ve tried to do benchmarks early on, about when Llama 2 was a thing… Followed the Reddit discussions. And then at some point I wanted to replace Mistral-Nemo with something newer but I disliked how every other model had turned to the ChatGPT / sycophant style of talking… But it’s a massively laborious undertaking. The official benchmarks don’t cover any of that. And there’s no good way to automate it either. So I spend half a day reading output manually and rating it in an Excel spreadsheet. With some success, but it’s way too complicated. So I mainly eyeball it these days. And sometimes there’s some recommendations somewhere on the internet. And I learned to accept how Chatbots always go on and on with redundant information unless I tell them skip the bullshit, I have an appointment at the hairdresser in 10 minutes and you need to explain it in 3 sentences. 😄
I suppose for tasks like coding, or factual knowledge it’s way easier to come up with fully automated benchmarks.
Hmmh. I’ve tried to do benchmarks early on, about when Llama 2 was a thing… Followed the Reddit discussions. And then at some point I wanted to replace Mistral-Nemo with something newer but I disliked how every other model had turned to the ChatGPT / sycophant style of talking… But it’s a massively laborious undertaking. The official benchmarks don’t cover any of that. And there’s no good way to automate it either. So I spend half a day reading output manually and rating it in an Excel spreadsheet. With some success, but it’s way too complicated. So I mainly eyeball it these days. And sometimes there’s some recommendations somewhere on the internet. And I learned to accept how Chatbots always go on and on with redundant information unless I tell them skip the bullshit, I have an appointment at the hairdresser in 10 minutes and you need to explain it in 3 sentences. 😄
I suppose for tasks like coding, or factual knowledge it’s way easier to come up with fully automated benchmarks.