They said their’s is “comparable with the 8-bit models”. Its all tradeoffs. It isn’t clear to me where you allocate your compute/memory budget. I’ve noticed that full 7b 16 bit models often produce better results for me than some much larger quantied models. It will be interesting to find the sweet spot.
Apparently I am an idiot and read the wrong paper. The previous paper mentioned that “comparable with the 8-bit models”
https://huggingface.co/papers/2310.11453