Newtra

Newtra@pawb.social · 2 years ago

The funny thing is that YouTube’s code is already so laggy that we all believed this without a second thought.

Newtra@pawb.social · 2 years ago

Yeah, I was over-enthusiastic based on their cherry-picked examples. SeamlessExpressive still leaves a lot to be desired.

It has a limited range of emotions and can’t change emotion in the middle of the clip. It can’t produce the pitch shifts of someone talking excitedly, making the output sound monotonous. Background noise in the input causes a raspy, distorted output voice. Sighs, inter-sentence breaths, etc. aren’t reproduced. Sometimes the sentence pacing is just completely unnatural, with missing pauses or pauses in bad places (e.g. before the sentence-final verb in German).

IMO their manual dataset creation is holding them back. If I was in this field, I would try to follow the LLM route: Start with a next-token predictor trained indiscriminately on large-scale speech+text data (e.g. TV shows, movies, news radio, all with subtitles even if the subs need to be AI generated), fine-tune it for specific tasks (mainly learning to predict and generate based on “style tokens” (speaker, emotion, accent, pacing)), then generate a massive “textbook” synthetic dataset. The translation aspect could be almost completely outsourced to LLMs or multilingual subtitles.

Newtra@pawb.social · 2 years ago

This is so exciting!

I can’t wait to see how well the Expressive model does on anime and foreign films. I wouldn’t be surprised if this was the end of terrible dubs.

This is gonna be great for language learning as well. Finally being able to pick any media and watch it in any language. It might even be possible to rig it up to an LLM to tune the vocab to your exact level…