All voices are unique, yet speakers tend to adjust their delivery, or speaking style, according to their context and audience. Before Amazon Polly used Neural Text-to-Speech technology (NTTS) to build voices, TTS (Standard Text-to-Speech) voices couldn’t change their speech patterns to match any particular speaking style. When Amazon Polly introduced NTTS, Newscaster voices were launched as the first speaking style.
Matthew and Joanna, two of the US English voices in the Amazon Polly portfolio, are now also available in a Conversational speaking style, which simulates the speech patterns of a friendly conversation. Similar to how people learn to talk as a children, TTS voices acquire intonation patterns from natural speech data, then try to reproduce synthesized utterances in similar manners. Amazon Polly’s NTTS technology, a neural network-based machine learning model, makes this learning possible. It is capable of picking up nuances in various speaking styles and applying them when synthesizing text into speech.
Pillo Health is a startup that uses Amazon Polly to voice their in-home devices. Paige Baeder, Pillo Health’s product manager, says, “Pillo Health serves individuals who manage chronic conditions in the comfort of their home. Maintaining our community’s trust starts with each daily interaction. The Conversational version of Amazon Polly’s Joanna voice provides clarity and expression that inspires trust and is easy to understand, allowing us to connect with our users through a voice that brings Pillo (our in-home companion device) persona to life. Making the decision to switch to Joanna in Amazon Polly was easy—it was the top pick amongst all of our voice testers.”
Unlike traditional synthesis approaches that rely heavily on constructed rules, NTTS builds its own model based on given training data. Dynamic intonation and expressiveness used to be obstacles because linguistic rules could not cover them, but now they are the key to voices sounding natural in NTTS. The system needs to recognize the diversity in speech, in order to mimic it when generating speech. In the studio, Amazon Polly’s voice talents record in an engaging tone, as they would when they engage in normal day-to-day conversation. A few characteristics of natural speech include reduced syllables, pitch change, pausing, and contractions. The recording script for training data is carefully designed based on common utterances, which helps deliver natural speech data.
The Conversational speaking style feature generally makes neural voices sound more friendly and expressive. For example, listen to the following audio sample from Matthew in the Conversational speaking style, as compared to the neutral neural style (speaking-style free):
Neutral sample (Matthew)
Conversational sample (Matthew)
In the Conversational speech sample, the word “sorry” is emphasized with a slight pause and a stress, which sounds more empathetic in this given situation. The question also sounds more friendly in the Conversational version, providing a better user experience.
Here’s Joanna introducing the Conversational style:
Neutral sample (Joanna)
Conversational sample (Joanna)
To synthesize the Conversational style, make sure to enclose the input with the following SSML tag and set the text type to ssml in the command line:
You can trigger the Conversational speaking style with US English voices Matthew and Joanna within the Amazon Polly console, AWS CLI, or SDK. The feature is currently available in US East (N. Virginia), US West (Oregon), and EU (Ireland) Regions. For more information, see What Is Amazon Polly?
About the author
Chiao-ting Fang is a TTS language engineer for Amazon text-to-speech. She enjoys applying her linguistic knowledge at work to build better, more natural-sounding voices. She loves languages, traveling, and star-gazing.