What is Text-to-Speech (TTS)?

Written by

Ben Luks

Computational Linguist, AI Researcher & MSc in AI Voice Technology

Table of Contents

Step 1. the title of the step goes here as expected

Summary

Text-to-speech (TTS) converts text into lifelike speech using neural networks for natural prosody and voice quality.
TTS pipelines process text, analyze linguistics, generate spectrograms, and synthesize audio with vocoders.
TTS powers chatbots, navigation systems, entertainment, healthcare tools, and inclusive education.
High-quality TTS improves clarity, brand voice, accessibility, and user trust across industries.

Dutch ChatGPT speaks with a German accent (sometimes). If it’s on purpose, it’s mean. If it’s not, then it’s fascinating.

Either way, it’s safe to say that AI voice assistants have come a long way from Microsoft’s Sam. In fact, they’ve come a pretty long way since I studied speech technology a few years ago.

And I’m here to tell you about where we’ve landed.

We’ve been mythologizing about synthesized speech since at least 1968, since the appearance of HAL the robot in 2001: A Space Odyssey.

close up of HAL-9000 in 2001 — From 2001: A Space Odyssey

Far from being prestigious and futuristic, it’s since become standard: 89% of consumers condition their choice of device on whether or not it has voice support.

In other words, “Don’t just help me; talk to me”.

In this article I’ll discuss text-to-speech– the conversion of text into spoken audio. I’ll talk about what goes on under-the-hood, and the different ways this technology is used across industries.

Build AI Chatbots

Build custom agentic chatbots

Start now

What is Text-to-Speech?

TTS is the process of converting text into synthesized spoken audio. Early versions were based on mechanically approximating the human vocal tract and stitching together audio recordings. Nowadays, TTS systems use deep neural network algorithms to deliver dynamic, human-like utterances.

Different models exist depending on the use case, such as real-time generation for conversational models, controllable expression, and the ability to replicate a voice.

How does Text-to-Speech work?

TTS has 3 key steps: first, input text is processed to spell out symbols, expressions and abbreviations. The processed text is then passed through neural networks that convert it into an acoustic representation (spectrogram). Finally, the representation is turned into speech.

Like I mentioned, researchers have cycled through a number of approaches to TTS. The one we’ve landed on (and where I reckon we’ll stay for some time) uses neural network-based speech synthesis.

Modelling the layers of linguistic phenomena that influence an utterance– pronunciation, speed, intonation– is an involved task.

Even with the quasi-magical black-box capabilities of neural networks, a TTS system relies on a bunch of components to approximate speech.

It’s tough to pin down one exact pipeline; new technologies are popping up left-and-right, threatening to make their predecessors obsolete.

There are a few general components that exist in most TTS systems in one form or another.

1. Text Processing

Text processing is the step where the TTS system determines which words will be uttered. Abbreviations, dates, and currency symbols are spelled out, and punctuation is eliminated.

This isn’t always trivial. Does “Dr.” mean doctor or drive? How about CAD? Canadian dollar or computer-aided design?

Natural language processing (NLP) can be employed in text processing to help predict the correct interpretation based on the surrounding context. It evaluates how the ambiguous term (for example, “Dr.”) fits into the sentence as a whole, so in the phrase “Dr. Perron advised against it”, NLP would resolve dr. to doctor.

2. Linguistic Analysis

Once text is processed, the model shifts from “What should I say?” to “How should I say it?”

Linguistic analysis is the part of TTS responsible for interpreting how a sentence should be delivered in terms of pitch, tone and duration. In other words:

How long should each sound, syllable, or word be?
Should the intonation rise? Fall?
Which word is being emphasized?
How can the change in volume reflect the intended emotion?

Why Prosody Matters

Story time: I had a brief gig consulting for a team building TTS models. It became apparent how much prosody makes or breaks the intelligibility of a sentence. I’ll show you what I mean.

The following are 3 deliveries of the sentence “Whoa, were you expecting that?”

The first is great. The pause after “Whoa”, the upwards inflection on the second syllable of “expecting” (ex-PEC-ting). 10/10.

The second just barely captures the question quality by inflecting up on the last word (“... expecting THAT”). Other than that, the rest of the syllables are more-or-less the same length, with no variation in volume or pitch. I’d tell my clients to “hit the drawing board”.

The last one is an interesting case: The “whoah is great– loud, long, and with a falling contour. The rising inflection of the question happens over the course of “were you”, and basically holds a steady pitch throughout.

This is where many middle-of-the-road TTS systems stop: simple enough with a plausible delivery. The thing is, it’s not how you would say it– at least not in most contexts.

In older systems, these qualities were predicted by separate components: one model would calculate how long each sound should last, another would map out how the pitch should rise and fall.

Nowadays, things are blurrier.

Neural networks tend to learn these patterns on their own by internalizing the fine subtleties of massive training datasets.

3. Acoustic Modelling

Acoustic modelling is where the normalized text (and predicted linguistic features, if any) are passed through a neural network that outputs an intermediate representation.

Spectrograms and Speech Representations

The intermediate representation is usually a spectrogram – the frequency-over-time representation of an audio signal – although that’s changing.

Here’s the representation generated by a TTS model from our input text “Whoa, were you expecting that?”:

Mel spectrogram with indicators for axis and dimensionality — A mel-spectrogram representation of an utterance generated by the Tacotron TTS model

‍

This 2-dimensional image is actually 146 vertical slices, each containing 80 frequencies. The stronger frequencies are brighter, and the weaker ones are dark.

Here’s what 10th time step (or column) looks like, rotated 90 degrees to the right:

Spectrogram frequency energies at one particular slice — One vertical slice (or column) of a spectrogram, turned on its side for convenience

You can see the individual frequencies and their energies.

At first glance the spectrogram doesn’t look like much, but some clear linguistic phenomena are present here:

Those waves clearly defined lines are vowels or vowel-like sounds like /w/, /r/, and /l/.
Dark spots represent silence. Those could be pauses for punctuation.
Clumps of energy up high represent noise, like the noise you hear in /s/, /sh/, and /f/

In fact, you can even line the words up in the spectrogram if you look carefully.

Spectrogram with broken lines indicating word boundaries, and their respective transcriptions underneath. — The above spectrogram, aligned to the individual words (or sounds).

Spectrograms, in their various forms, are widely used representations in speech technology because they’re a very good intermediate between raw speech and text.

Two recordings of the same sentence spoken by different speakers will have very different waveforms, but very similar spectrograms.

4. Synthesizing Audio (Vocoding)

The synthesis stage is where the spectrogram is converted into audio.

The technology that does this conversion is called a vocoder. They’re neural network models trained to reconstruct speech signals based on their spectrogram representations.

The reason for splitting the representation and speech signal modelling into separate modules is about control: the first is about accurately modelling the pronunciation and delivery of words, and the next is about the style and realisticness of the delivery.

With a spectrogram we can discern between /s/ vs /sh/, or /ee/ (as in heat) vs. /ih/ (as in hit), but the style and personality come from the fine details produced by the vocoder.

Here’s a comparison of combinations between different acoustic models and vocoders. It illustrates how researchers mix-and-match acoustic models and vocoders, and optimize for the best overall result.

But again, as with all the other components, we’re seeing spectrograms phased out in favor of all-in-one models.

Deploying AI Agents?

Read our Blueprint for AI Agent Implementation

Read Now

What are the Use Cases of TTS?

The ability to generate dynamic spoken language is an essential tool across industries.

It’s not only about sophisticated robot servants – it helps us achieve efficiency, accessibility, and safety.

Chatbots and Voice Assistants

You knew I was going to say it 😉

Between understanding your commands, updating your grocery lists, and setting appointments, it’s easy to take for granted the sophistication– and importance– of the synthesized speech in AI agents.

A good agent, (i.e. a usable one) has to have a voice that fits the bill: welcoming enough to solicit commands, and human enough to make the user believe it can fulfill them.

Lots of research and engineering goes into winning over users in the split second it takes to decide whether or not an AI assistant sounds “right”.

On the business side of things: your chatbot represents your brand. Improvements in TTS technology means options for better voice branding and more effective customer service.

Navigation and Transport

Nothing will make you realize the importance of good TTS like having your GPS unintelligibly mispronounce a street name while you’re driving.

GPS navigation is a great example of where TTS shines: our eyes are occupied, and delivering audible information is not only about convenience, but about safety.

This is also true in airports and public transport systems. For intricately designed, high volume systems like train stations and airport terminals, synthesized speech is crucial.

Without TTS, we’re relying on live announcements, which are often hasty and unintelligible, or stitched-together recordings of names, terminals, times, etc., which are frankly hard to listen to.

With studies showing a strong link between naturalness and intelligibility, high-quality TTS is a must for a robust transport industry.

Entertainment and Media

Narration and multilingual media have become more available with improvements to synthetic speech technology.

Rather than replacing talent, speech technology helps augment dramatic performances.

Val Kilmer, having lost his voice to throat cancer, delivered a heartfelt performance with his original voice in Top Gun: Maverick (2022) thanks to AI.

TTS also lets game developers give diverse, expressive utterances to non-playable characters (NPCs), an otherwise infeasible feat.

Healthcare

Improvements in TTS mean improvements to accessibility across the board.

Elder care technologies tackle the matter of companionship and assistance simultaneously. This technology relies on the customizability that TTS offers: compassionate tones, variable speeds, and careful intonation are all part of offering effective and dignified assistance.

TTS is also being used to improve accessibility amongst younger folk.

Acapela Group develops, among other things, technologies for children with speech production disorders. Synthetic speech augments their expressive capabilities and independence, while preserving their vocal characteristics.

Education and Inclusive Learning

We’ve come across synthetic speech in language learning apps. But that’s just the tip of the iceberg.

For example, a barrier of entry in independent learning is the ability to read. For children, people with visual impairments, and certain learning disabilities, that isn’t necessarily possible. This puts a lot of onus on overworked teachers in overcrowded classrooms.

A school district in California has implemented TTS to create a more inclusive learning environment for students with special needs.

Much as in the case of eldercare, educational technology relies on compassionate voices delivering with pristine clarity and emphasis. Modifiable parameters make it possible for teachers to integrate these technologies into their lessons, helping students feel more included.

Get the Best TTS for Your Needs

No matter your industry, it’s safe to say that voice AI is relevant. And the TTS you implement quite literally speaks for your business, so it needs to be reliable and customizable.

Botpress lets you build powerful, highly-customizable bots with a suite of integrations and deployment across all common communication channels. Your voice agent won’t only impress, it’ll work.

Start building today. It’s free.

Build AI Chatbots

Build custom agentic chatbots

Start now

FAQs

Are there languages or dialects that TTS systems struggle to support?

Yes, there are languages and dialects that TTS systems struggle to support, especially low-resource languages that lack large datasets of recorded speech and text. Variations like regional dialects, tonal languages, and indigenous languages often pose challenges because they require nuanced pronunciation rules and prosody that standard models haven’t been trained on. Even for widely spoken languages, dialectal differences can lead to mispronunciations or unnatural-sounding speech.

How customizable are TTS voices in terms of pitch, speed, and emotion?

TTS voices today are highly customizable in pitch, speed, and emotion, thanks to modern neural network architectures that allow for fine-grained control over prosody and style. Many commercial TTS systems let users adjust speaking rate, intonation patterns, volume, and expressive tone to suit different contexts, such as calm narration, excited announcements, or empathetic dialogue. However, the degree of control varies by vendor—some offer only basic sliders for speed and pitch, while others expose detailed parameters for emotional expression and vocal timbre.

How secure is voice data processed by TTS systems?

The security of voice data processed by TTS systems depends heavily on the provider and deployment method. Cloud-based TTS services usually encrypt data in transit and at rest, but sending sensitive information to external servers can still pose privacy risks if proper agreements and compliance measures like GDPR or HIPAA are not in place. On-premises or edge deployments provide higher security because audio and text never leave the organization’s infrastructure, reducing exposure to third parties.

How expensive is it to implement high-quality TTS solutions for businesses?

Implementing high-quality TTS solutions for businesses can range from a few hundred dollars per month for cloud-based APIs with moderate usage, to tens or hundreds of thousands for custom voice development or on-premises enterprise deployments. Costs typically include licensing fees, pay-per-character or pay-per-minute usage costs, integration and development efforts, and possibly voice talent fees if creating a custom voice. Small businesses often start with subscription-based services, while larger enterprises may invest in bespoke solutions for brand consistency and privacy.

How much training data does it take to build a high-quality TTS voice?

Building a high-quality TTS voice usually requires several hours to dozens of hours of clean, professionally recorded speech, ideally from the same speaker and under consistent recording conditions. Modern neural TTS systems like Tacotron or FastSpeech can achieve decent quality with as little as 2–5 hours of data, but achieving highly natural, expressive, and robust voices often takes 10–20 hours or more. For voice cloning or very expressive voices, even larger datasets and diverse recordings covering various styles, emotions, and contexts are needed.