Introducing the Latest Technological Advancement in Azure Neural TTS: Uni-TTSv3
Microsoft announced updates to Neural TTS, a Speech capability of Cognitive Services. These updates include a multilingual voice (JennyMultilingualNeural) that can speak 14 languages and a preview feature in Custom Neural Voice that allows customers to create a brand voice that speaks different languages.
Read this Microsoft developer post for details on the technology behind these feature updates with Uni-TTSv3.
What is Uni-TTSv3 and how is it different from earlier Azure Neural TTS models?
Uni-TTSv3 is the latest neural acoustic model behind Azure Neural Text-to-Speech (TTS) and Custom Neural Voice. It’s designed to reimagine how voices are trained and deployed, especially for multilingual and custom brand voices.
Earlier Uni-TTS versions (v2) used a teacher–student training pipeline with three stages: training a teacher model, fine-tuning it, and then training student models. That approach worked, but it was complex and costly to scale, particularly for self-service Custom Neural Voice projects.
Uni-TTSv3 changes this in a few important ways:
- Simplified training: It’s a non-autoregressive model based on FastSpeech 2 that is trained directly from recordings, so it no longer needs the teacher–student process.
- Large, diverse base model: It’s trained on about 3,000 hours of human recordings across multiple speakers and locales, creating a strong multilingual base model that can then be fine-tuned for specific voices.
- Fine-grained prosody control: It includes phoneme-level style embeddings to better capture prosody (tone, breaks, emphasis), which improves naturalness and speaker similarity.
- Explicit control of voice and accent: It uses speaker IDs and locale IDs to control timbre and accent, which is key for brand consistency across markets.
In practice, this means Uni-TTSv3 delivers comparable or better quality than Uni-TTSv2, while making training faster, more stable, and easier to scale for many different custom voices.
How does Uni-TTSv3 help reduce training time and handle imperfect customer audio?
Uni-TTSv3 was built to make Custom Neural Voice more practical for real projects, where time, budget, and data quality are real constraints.
On training time and cost
- By removing the teacher–student pipeline and training the acoustic model directly, Uni-TTSv3 simplifies the training workflow.
- On the acoustic training portion, Uni-TTSv3 can cut training time by around 50% compared to the previous approach.
- Faster training translates into lower compute costs and shorter turnaround times for creating or updating a custom voice.
On handling non-ideal customer recordings
- Customer data often comes with background noise, inconsistent recording environments, or varying quality. Uni-TTSv3 introduces a denoising module in the fine-tuning stage to reduce the impact of these issues.
- The model is trained with phoneme-based alignments, which makes the training process more stable and helps avoid common problems like skipped or repeated words seen in some attention-based models.
- Extensive testing across more than 40 languages showed that Uni-TTSv3 delivers equal or better voice quality than Uni-TTSv2, even while training time is reduced.
For you, this means you can bring your own recordings—within reasonable quality expectations—and still get a robust, natural-sounding custom voice with less time and effort than before.
Can one custom voice speak multiple languages with Uni-TTSv3?
Yes. Uni-TTSv3 is specifically designed to support cross-lingual and multilingual voices, so a single voice can serve multiple markets.
How multilingual support works
- Uni-TTSv3 is trained on multi-speaker, multilingual datasets, with speaker IDs and locale IDs controlling timbre and accent.
- This allows a voice to speak multiple languages even if there are no recordings of that same human speaker in all target languages.
What’s available in Azure today
- On the Azure TTS platform, Uni-TTSv3 powers voices like JennyMultilingualNeural, which can speak 14 languages with a consistent timbre.
- Across supported languages, Jenny’s average Mean Opinion Score (MOS) is above 4.2 out of 5, indicating strong perceived naturalness.
For Custom Neural Voice
- Uni-TTSv3 is integrated into Custom Neural Voice to enable cross-lingual voice creation.
- You can train a custom voice using speech samples in just one language, and then use that voice to speak additional languages supported by the system.
- This helps you avoid separate casting and recording sessions for each language, reducing both effort and cost when expanding to new markets.
Typical use cases include multilingual chatbots, IVR systems, read-aloud features, audiobooks, and translation apps where you want one consistent brand voice across different locales.

Introducing the Latest Technological Advancement in Azure Neural TTS: Uni-TTSv3
published by NOVA IT Solutions
Nova IT Solutions is a viable option for small and medium size businesses. Our goal is to become Your number one source for IT Maintenance, Cloud Computing, Help Desk Support, System Upgrades, Data Migration and IT Consulting. We are dedicated in providing you the very best of our services.
Nova IT Solutions has come a long way from its beginnings as a part time IT company, providing solutions to local businesses in Northern and Central Virginia. We now serve customers in the entire DMV region, and we are thrilled to be a part of the affordable and tech savvy solutions, that can take your business to new heights