Meta announces Voicebox, a generative model for multiple text-to-speech tasks

Join senior executives in San Francisco on July 11-12 to learn how leaders are integrating and optimizing AI investments for success. Learn more

Last week, the artificial intelligence research branch of Meta Platforms presented Voicebox, a machine learning model capable of generating speech from text. What sets Voicebox apart from other text-to-speech models is its ability to perform many tasks it was not trained to do, including editing, noise removal, and style transfer.

The model was trained using a special method developed by Meta researchers. Although Meta has not released Voicebox due to ethical concerns over misuse, early results are promising and may fuel many applications in the future.

“Stream Match”

Voicebox is a generative model capable of synthesizing speech in six languages, including English, French, Spanish, German, Polish and Portuguese. Like large language models, it was trained on a very general task that can be used for many applications. But whereas LLMs try to learn the statistical regularities of words and text sequences, Voicebox was trained to learn the patterns that map speech audio samples to their transcriptions.

Such a model can then be applied to many downstream tasks with little or no fine tuning. “The goal is to create a single model capable of performing many text-guided speech generation tasks through learning-in-context,” the Meta researchers write in their paper (PDF) describing the technical details of Voicebox.


Transform 2023

Join us in San Francisco on July 11-12, where senior executives will share how they integrated and optimized AI investments for success and avoided common pitfalls.

Register now

The model was trained in Meta’s “Flow Matching” technique, which is more efficient and generalizable than diffusion-based learning methods used in other generative models. The technique allows Voicebox “to learn from varied voice data without those variations needing to be carefully labelled.” Without the need for manual labeling, researchers were able to train Voicebox on 50,000 hours of speech and audiobook transcriptions.

The model uses “text-guided speech filler” as its training objective, which means that it should predict a segment of speech based on its audio environment and the complete text transcription. Basically, this means that during training, the model receives an audio sample and its corresponding text. Parts of the audio are then masked and the model tries to generate the masked part using the surrounding audio and the transcript as context. By doing this over and over again, the model learns to generate natural speech from the text in a generalizable way.

Reproduction of voices in all languages, correction of speech errors, etc.

Unlike generative models which are trained for a specific application, Voicebox can perform many tasks that it was not trained for. For example, the model can use a two-second voice sample to generate speech for new text. Meta says this ability can be used to make people who can’t speak speak, or customize the voices of non-playable game characters and virtual assistants.

Voicebox also performs style transfer in a number of ways. For example, you can provide the model with two audio and text samples. It will use the first audio sample as the style reference and modify the second to match the voice and pitch of the reference. Interestingly, the model can do the same thing in different languages, which could be used to “help people communicate in a natural and authentic way, even if they don’t speak the same languages”.

The template can also perform various editing tasks. For example, if a dog is barking in the background while you’re recording your voice, you can provide the audio and transcript to Voicebox and hide the segment with background noise. The model will use the transcription to generate the missing part of the audio without the background noise.

The same technique can be used to edit speech. For example, if you mispronounced a word, you can hide that part of the audio sample and pass it to Voicebox along with a transcription of the changed text. The template will generate the missing part with the new text in a way that matches the surrounding voice and tone.

One of Voicebox’s interesting applications is voice sampling. The model can generate various speech samples from a single text sequence. This capability can be used to generate synthetic data to train other speech processing models. “Our results show that speech recognition models trained on synthetic speech generated by Voicebox perform nearly as well as models trained on real speech, with a 1% error rate degradation as opposed to a 45 to 70% with synthetic speech from previous text-to-speech models,” writes Meta.

Voicebox also has limitations. Since it was trained on audiobook data, it does not transfer well to conversational speech which is casual and contains non-verbal sounds. It also does not provide full control over the various attributes of the generated speech, such as voice style, pitch, emotion, and acoustic condition. The Meta research team is exploring techniques to overcome these limitations in the future.

Unpublished model

There are growing concerns about the threats posed by AI-generated content. For example, cybercriminals recently tried to scam a woman by calling her and using the AI-generated voice to impersonate her grandson. Advanced text-to-speech systems such as Voicebox could be used for similar purposes or other nefarious acts, such as creating false evidence or manipulating real sounds.

“As with other powerful new innovations in AI, we recognize that this technology has the potential for misuse and unintended harm,” Meta wrote on his AI blog. Due to these concerns, Meta has not released the model but has provided technical details of the architecture and formation process in the technical document. The document also contains details of a classifier model capable of detecting speech and audio generated by Voicebox in order to mitigate the risks associated with using the model.

The GamesBeat creed when covering the video game industry is “where passion meets business”. What does that mean? We want to tell you how much the news means to you, not only as a decision maker in a game studio, but also as a game fan. Whether you read our articles, listen to our podcasts, or watch our videos, GamesBeat will help you learn about and engage with the industry. Discover our Briefings.

Leave a Comment