SeamlessM4T


SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages.

SeamlessM4T models support the tasks of:

  • Speech-to-speech translation (S2ST)
  • Speech-to-text translation (S2TT)
  • Text-to-speech translation (T2ST)
  • Text-to-text translation (T2TT)
  • Automatic speech recognition (ASR)

🌟 We are releasing SemalessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference latency in speech generation tasks.

To learn more about the collection of SeamlessM4T models, the approach used in each, their language coverage and their performance, visit the SeamlessM4T README or 🤗 Model Card

Code: https://github.com/facebookresearch/seamless_communication

  • Newtra@pawb.social
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    This is so exciting!

    I can’t wait to see how well the Expressive model does on anime and foreign films. I wouldn’t be surprised if this was the end of terrible dubs.

    This is gonna be great for language learning as well. Finally being able to pick any media and watch it in any language. It might even be possible to rig it up to an LLM to tune the vocab to your exact level…

    • Newtra@pawb.social
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Yeah, I was over-enthusiastic based on their cherry-picked examples. SeamlessExpressive still leaves a lot to be desired.

      It has a limited range of emotions and can’t change emotion in the middle of the clip. It can’t produce the pitch shifts of someone talking excitedly, making the output sound monotonous. Background noise in the input causes a raspy, distorted output voice. Sighs, inter-sentence breaths, etc. aren’t reproduced. Sometimes the sentence pacing is just completely unnatural, with missing pauses or pauses in bad places (e.g. before the sentence-final verb in German).

      IMO their manual dataset creation is holding them back. If I was in this field, I would try to follow the LLM route: Start with a next-token predictor trained indiscriminately on large-scale speech+text data (e.g. TV shows, movies, news radio, all with subtitles even if the subs need to be AI generated), fine-tune it for specific tasks (mainly learning to predict and generate based on “style tokens” (speaker, emotion, accent, pacing)), then generate a massive “textbook” synthetic dataset. The translation aspect could be almost completely outsourced to LLMs or multilingual subtitles.