Dutch ChatGPT speaks with a German accent (sometimes). If it’s on purpose, it’s mean. If it’s not, then it’s fascinating.
Either way, it’s safe to say that AI voice assistants have come a long way from Microsoft’s Sam. In fact, they’ve come a pretty long way since I studied speech technology a few years ago.
And I’m here to tell you about where we’ve landed.
We’ve been mythologizing about synthesized speech since at least 1968, since the appearance of HAL the robot in 2001: A Space Odyssey.

Far from being prestigious and futuristic, it’s since become standard: 89% of consumers condition their choice of device on whether or not it has voice support.
In other words, “Don’t just help me; talk to me”.
In this article I’ll discuss text-to-speech– the conversion of text into spoken audio. I’ll talk about what goes on under-the-hood, and the different ways this technology is used across industries.
What is Text-to-Speech?
TTS is the process of converting text into synthesized spoken audio. Early versions were based on mechanically approximating the human vocal tract and stitching together audio recordings. Nowadays, TTS systems use deep neural network algorithms to deliver dynamic, human-like utterances.
Different models exist depending on the use case, such as real-time generation for conversational models, controllable expression, and the ability to replicate a voice.
How does Text-to-Speech work?
TTS has 3 key steps: first, input text is processed to spell out symbols, expressions and abbreviations. The processed text is then passed through neural networks that convert it into an acoustic representation (spectrogram). Finally, the representation is turned into speech.
Like I mentioned, researchers have cycled through a number of approaches to TTS. The one we’ve landed on (and where I reckon we’ll stay for some time) uses neural network-based speech synthesis.
Modelling the layers of linguistic phenomena that influence an utterance– pronunciation, speed, intonation– is an involved task.

Even with the quasi-magical black-box capabilities of neural networks, a TTS system relies on a bunch of components to approximate speech.
It’s tough to pin down one exact pipeline; new technologies are popping up left-and-right, threatening to make their predecessors obsolete.
There are a few general components that exist in most TTS systems in one form or another.
1. Text Processing
Text processing is the step where the TTS system determines which words will be uttered. Abbreviations, dates, and currency symbols are spelled out, and punctuation is eliminated.
This isn’t always trivial. Does “Dr.” mean doctor or drive? How about CAD? Canadian dollar or computer-aided design?
Natural language processing (NLP) can be employed in text processing to help predict the correct interpretation based on the surrounding context. It evaluates how the ambiguous term (for example, “Dr.”) fits into the sentence as a whole, so in the phrase “Dr. Perron advised against it”, NLP would resolve dr. to doctor.
2. Linguistic Analysis
Once text is processed, the model shifts from “What should I say?” to “How should I say it?”
Linguistic analysis is the part of TTS responsible for interpreting how a sentence should be delivered in terms of pitch, tone and duration. In other words:
- How long should each sound, syllable, or word be?
- Should the intonation rise? Fall?
- Which word is being emphasized?
- How can the change in volume reflect the intended emotion?
Why Prosody Matters
Story time: I had a brief gig consulting for a team building TTS models. It became apparent how much prosody makes or breaks the intelligibility of a sentence. I’ll show you what I mean.
The following are 3 deliveries of the sentence “Whoa, were you expecting that?”
The first is great. The pause after “Whoa”, the upwards inflection on the second syllable of “expecting” (ex-PEC-ting). 10/10.
The second just barely captures the question quality by inflecting up on the last word (“... expecting THAT”). Other than that, the rest of the syllables are more-or-less the same length, with no variation in volume or pitch. I’d tell my clients to “hit the drawing board”.
The last one is an interesting case: The “whoah is great– loud, long, and with a falling contour. The rising inflection of the question happens over the course of “were you”, and basically holds a steady pitch throughout.
This is where many middle-of-the-road TTS systems stop: simple enough with a plausible delivery. The thing is, it’s not how you would say it– at least not in most contexts.
In older systems, these qualities were predicted by separate components: one model would calculate how long each sound should last, another would map out how the pitch should rise and fall.
Nowadays, things are blurrier.
Neural networks tend to learn these patterns on their own by internalizing the fine subtleties of massive training datasets.
3. Acoustic Modelling
Acoustic modelling is where the normalized text (and predicted linguistic features, if any) are passed through a neural network that outputs an intermediate representation.
Spectrograms and Speech Representations
The intermediate representation is usually a spectrogram – the frequency-over-time representation of an audio signal – although that’s changing.
Here’s the representation generated by a TTS model from our input text “Whoa, were you expecting that?”:

This 2-dimensional image is actually 146 vertical slices, each containing 80 frequencies. The stronger frequencies are brighter, and the weaker ones are dark.
Here’s what 10th time step (or column) looks like, rotated 90 degrees to the right:

You can see the individual frequencies and their energies.
At first glance the spectrogram doesn’t look like much, but some clear linguistic phenomena are present here:
- Those waves clearly defined lines are vowels or vowel-like sounds like /w/, /r/, and /l/.
- Dark spots represent silence. Those could be pauses for punctuation.
- Clumps of energy up high represent noise, like the noise you hear in /s/, /sh/, and /f/
In fact, you can even line the words up in the spectrogram if you look carefully.

Spectrograms, in their various forms, are widely used representations in speech technology because they’re a very good intermediate between raw speech and text.
Two recordings of the same sentence spoken by different speakers will have very different waveforms, but very similar spectrograms.
4. Synthesizing Audio (Vocoding)
The synthesis stage is where the spectrogram is converted into audio.
The technology that does this conversion is called a vocoder. They’re neural network models trained to reconstruct speech signals based on their spectrogram representations.
The reason for splitting the representation and speech signal modelling into separate modules is about control: the first is about accurately modelling the pronunciation and delivery of words, and the next is about the style and realisticness of the delivery.
With a spectrogram we can discern between /s/ vs /sh/, or /ee/ (as in heat) vs. /ih/ (as in hit), but the style and personality come from the fine details produced by the vocoder.
Here’s a comparison of combinations between different acoustic models and vocoders. It illustrates how researchers mix-and-match acoustic models and vocoders, and optimize for the best overall result.
But again, as with all the other components, we’re seeing spectrograms phased out in favor of all-in-one models.
What are the Use Cases of TTS?
The ability to generate dynamic spoken language is an essential tool across industries.
It’s not only about sophisticated robot servants – it helps us achieve efficiency, accessibility, and safety.
Chatbots and Voice Assistants
You knew I was going to say it 😉
Between understanding your commands, updating your grocery lists, and setting appointments, it’s easy to take for granted the sophistication– and importance– of the synthesized speech in AI agents.
A good agent, (i.e. a usable one) has to have a voice that fits the bill: welcoming enough to solicit commands, and human enough to make the user believe it can fulfill them.
Lots of research and engineering goes into winning over users in the split second it takes to decide whether or not an AI assistant sounds “right”.
On the business side of things: your chatbot represents your brand. Improvements in TTS technology means options for better voice branding and more effective customer service.
Navigation and Transport
Nothing will make you realize the importance of good TTS like having your GPS unintelligibly mispronounce a street name while you’re driving.
GPS navigation is a great example of where TTS shines: our eyes are occupied, and delivering audible information is not only about convenience, but about safety.
This is also true in airports and public transport systems. For intricately designed, high volume systems like train stations and airport terminals, synthesized speech is crucial.
Without TTS, we’re relying on live announcements, which are often hasty and unintelligible, or stitched-together recordings of names, terminals, times, etc., which are frankly hard to listen to.
With studies showing a strong link between naturalness and intelligibility, high-quality TTS is a must for a robust transport industry.
Entertainment and Media
Narration and multilingual media have become more available with improvements to synthetic speech technology.
Rather than replacing talent, speech technology helps augment dramatic performances.
Val Kilmer, having lost his voice to throat cancer, delivered a heartfelt performance with his original voice in Top Gun: Maverick (2022) thanks to AI.
TTS also lets game developers give diverse, expressive utterances to non-playable characters (NPCs), an otherwise infeasible feat.
Healthcare
Improvements in TTS mean improvements to accessibility across the board.
Elder care technologies tackle the matter of companionship and assistance simultaneously. This technology relies on the customizability that TTS offers: compassionate tones, variable speeds, and careful intonation are all part of offering effective and dignified assistance.
TTS is also being used to improve accessibility amongst younger folk.
Acapela Group develops, among other things, technologies for children with speech production disorders. Synthetic speech augments their expressive capabilities and independence, while preserving their vocal characteristics.
Education and Inclusive Learning
We’ve come across synthetic speech in language learning apps. But that’s just the tip of the iceberg.
For example, a barrier of entry in independent learning is the ability to read. For children, people with visual impairments, and certain learning disabilities, that isn’t necessarily possible. This puts a lot of onus on overworked teachers in overcrowded classrooms.
A school district in California has implemented TTS to create a more inclusive learning environment for students with special needs.
Much as in the case of eldercare, educational technology relies on compassionate voices delivering with pristine clarity and emphasis. Modifiable parameters make it possible for teachers to integrate these technologies into their lessons, helping students feel more included.
Get the Best TTS for Your Needs
No matter your industry, it’s safe to say that voice AI is relevant. And the TTS you implement quite literally speaks for your business, so it needs to be reliable and customizable.
Botpress lets you build powerful, highly-customizable bots with a suite of integrations and deployment across all common communication channels. Your voice agent won’t only impress, it’ll work.
Start building today. It’s free.