Yeah, I was over-enthusiastic based on their cherry-picked examples. SeamlessExpressive still leaves a lot to be desired.
It has a limited range of emotions and can’t change emotion in the middle of the clip. It can’t produce the pitch shifts of someone talking excitedly, making the output sound monotonous. Background noise in the input causes a raspy, distorted output voice. Sighs, inter-sentence breaths, etc. aren’t reproduced. Sometimes the sentence pacing is just completely unnatural, with missing pauses or pauses in bad places (e.g. before the sentence-final verb in German).
IMO their manual dataset creation is holding them back. If I was in this field, I would try to follow the LLM route: Start with a next-token predictor trained indiscriminately on large-scale speech+text data (e.g. TV shows, movies, news radio, all with subtitles even if the subs need to be AI generated), fine-tune it for specific tasks (mainly learning to predict and generate based on “style tokens” (speaker, emotion, accent, pacing)), then generate a massive “textbook” synthetic dataset. The translation aspect could be almost completely outsourced to LLMs or multilingual subtitles.
The funny thing is that YouTube’s code is already so laggy that we all believed this without a second thought.