Introducing the Latest Technological Advancement in Azure Neural TTS: Uni-TTSv3

Microsoft announced updates to Neural TTS, a Speech capability of Cognitive Services. These updates include a multilingual voice (JennyMultilingualNeural) that can speak 14 languages and a preview feature in Custom Neural Voice that allows customers to create a brand voice that speaks different languages. Read this Microsoft developer post for details on the technology behind these feature updates with Uni-TTSv3.

View FAQs

Frequently Asked Questions

What is Uni-TTSv3 and how is it different from earlier Azure Neural TTS models?

Uni-TTSv3 is the latest neural acoustic model behind Azure Neural Text-to-Speech (TTS) and Custom Neural Voice. It’s designed to reimagine how voices are trained and deployed, especially for multilingual and custom brand voices.

Earlier Uni-TTS versions (v2) used a teacher–student training pipeline with three stages: training a teacher model, fine-tuning it, and then training student models. That approach worked, but it was complex and costly to scale, particularly for self-service Custom Neural Voice projects.

Uni-TTSv3 changes this in a few important ways:

Simplified training: It’s a non-autoregressive model based on FastSpeech 2 that is trained directly from recordings, so it no longer needs the teacher–student process.
Large, diverse base model: It’s trained on about 3,000 hours of human recordings across multiple speakers and locales, creating a strong multilingual base model that can then be fine-tuned for specific voices.
Fine-grained prosody control: It includes phoneme-level style embeddings to better capture prosody (tone, breaks, emphasis), which improves naturalness and speaker similarity.
Explicit control of voice and accent: It uses speaker IDs and locale IDs to control timbre and accent, which is key for brand consistency across markets.

In practice, this means Uni-TTSv3 delivers comparable or better quality than Uni-TTSv2, while making training faster, more stable, and easier to scale for many different custom voices.

How does Uni-TTSv3 help reduce training time and handle imperfect customer audio?

Uni-TTSv3 was built to make Custom Neural Voice more practical for real projects, where time, budget, and data quality are real constraints.

On training time and cost

By removing the teacher–student pipeline and training the acoustic model directly, Uni-TTSv3 simplifies the training workflow.
On the acoustic training portion, Uni-TTSv3 can cut training time by around 50% compared to the previous approach.
Faster training translates into lower compute costs and shorter turnaround times for creating or updating a custom voice.

On handling non-ideal customer recordings

Customer data often comes with background noise, inconsistent recording environments, or varying quality. Uni-TTSv3 introduces a denoising module in the fine-tuning stage to reduce the impact of these issues.
The model is trained with phoneme-based alignments, which makes the training process more stable and helps avoid common problems like skipped or repeated words seen in some attention-based models.
Extensive testing across more than 40 languages showed that Uni-TTSv3 delivers equal or better voice quality than Uni-TTSv2, even while training time is reduced.

For you, this means you can bring your own recordings—within reasonable quality expectations—and still get a robust, natural-sounding custom voice with less time and effort than before.

Can one custom voice speak multiple languages with Uni-TTSv3?

Yes. Uni-TTSv3 is specifically designed to support cross-lingual and multilingual voices, so a single voice can serve multiple markets.

How multilingual support works

Uni-TTSv3 is trained on multi-speaker, multilingual datasets, with speaker IDs and locale IDs controlling timbre and accent.
This allows a voice to speak multiple languages even if there are no recordings of that same human speaker in all target languages.

What’s available in Azure today

On the Azure TTS platform, Uni-TTSv3 powers voices like JennyMultilingualNeural, which can speak 14 languages with a consistent timbre.
Across supported languages, Jenny’s average Mean Opinion Score (MOS) is above 4.2 out of 5, indicating strong perceived naturalness.

For Custom Neural Voice

Uni-TTSv3 is integrated into Custom Neural Voice to enable cross-lingual voice creation.
You can train a custom voice using speech samples in just one language, and then use that voice to speak additional languages supported by the system.
This helps you avoid separate casting and recording sessions for each language, reducing both effort and cost when expanding to new markets.

Typical use cases include multilingual chatbots, IVR systems, read-aloud features, audiobooks, and translation apps where you want one consistent brand voice across different locales.

The full experience is only one step away!

NOVA IT Solutions is ready to help!

Please confirm your email address!