For coding AI, it could make sense to specialize models on architecture, functional/array split from loopy solutions, or just asking 4 separate small models, and then using a judge model to pick the best parts of each.
For coding AI, it could make sense to specialize models on architecture, functional/array split from loopy solutions, or just asking 4 separate small models, and then using a judge model to pick the best parts of each.
This optimization is about smaller matrix multiplications. Experts will specialize on input token types, and while it is better at being split accross resources (GPUs), it is not really specialization on “output domain” (type of work). All experts need to be in memory.
Deepseek made a 7b math focused LLM that beat all other models on math benchmarks, even 540b math specialist LLMs. More than any internal speed/structure “tricks”, they achieved this through highly curated training data.
The small models we get now tend to just be pruned from larger generalist models. Paper/video is suggesting smaller models that are “large tuned” post trained to be domain specialists. Large models could select from domain specialist models and only load those in memory or act as a judge in combining outputs of “sub models”
Where an LLM is a giant probabilistic classifier, there are much faster/accurate/less compute intensive deterministic classifiers (expert/rule systems). Where SLMs have advantages, using even cheaper classification steps is going in the same direction. A smaller LLM is automatically a faster classifier, as a hammer to bang on everything alternative.