Voxabot Blog

Common TTS Use Cases

February 16, 2021

Text-to-Speech is particularly well suited for creating mp3 audio files which need to be broadcast immediately. With Voxabot you can render consistent voices that never get tired in many languages and are available 24x7.

In this article we have compiled some use case samples which you might find inspiring:


Ski resort snow voice report

In the following sample we have recorded a snow report sample for skiers visiting Lake Louise ski resort.

Text

Good Morning! Today is 17th December 2020. 
Today conditions are looking great with continued warm temps and some additional snowfall expected. Snow coverage is at a season high, and with over 150 open trails (the most ever in our history), the options to explore are nearly endless.
*Please remember to plan ahead. There is minimal indoor space available, and dining options are takeout only.
IMPORTANT UPDATES & REMINDERS
On December 8, the Province of Alberta announced an increased level of restrictions due to the current state of COVID 19. With this in mind, please check important updates and reminders in www.skilouise.com
Have a great day!

TTS options used

TTS Engine: Amazon
Neural voice
Custom SSML: <amazon:domain name="news">
Language: English US
Voice: Mathew

Train Audio Announcement

In the following sample we have recorded a train audio announcement sample in Catalan, Spanish and English.

Text

En uns minuts arribarem a la estació de Barcelona Sants. Aquesta és la última parada. Recordi recollir el equipatge i gràcies per viatjar amb nosaltres.
En unos minutos llegaremos a la estación de Barcelona Sants. Esta es la última parada. No se olvide de recoger su equipaje y gracias por viajar con nosotros.
In a few min we will reach Barcelona Sants station. The train stops there. Please don't forget to collect your belongings and thank you for travelling with us.

TTS options used

TTS Engine: Microsoft
Languages: Catalan, Spanish Spain, English UK

Voices: Helena RUS, Helena Neural, Mia Neural

Inflight passenger announcement

In the following sample we have created a standard inflight passenger announcement which the airline could easily customize.


Text


Ladies and gentlemen, the Captain has turned on the Fasten Seat Belt sign. If you haven’t already done so, please stow your carry-on luggage underneath the seat in front of you or in an overhead bin. Please take your seat and fasten your seat belt. And also make sure your seat back and folding trays are in their full upright position.
If you are seated next to an emergency exit, please read carefully the special instructions card located by your seat. If you do not wish to perform the functions described in the event of an emergency, please ask a flight attendant to reseat you.
We remind you that this is a non-smoking flight. Smoking is prohibited on the entire aircraft, including the lavatories. Tampering with, disabling or destroying the lavatory smoke detectors is prohibited by law.
If you have any questions about our flight today, please don’t hesitate to ask one of our flight attendants. Thank you

TTS options used

TTS engine: Microsoft

Language: EN-US

Voice: Jessa Neural

Department Store Public Announcement

And here a child named Gilbert has been found lost in the technological section of a department store. 


Text

Dear customers, 

We have a lost boy named Gilbert who was found in the computer section of our store and he’s looking for his mother. He’s about five years old and he’s wearing a blue t-shirt and blue pants. You can find him at the check-out counter at the main entrance.
Thank you.

TTS options used

TTS engine: Google

Language: English US

Voice: Wavenet C

Today we Launch!

February 16, 2021

Today we have launched our updated site with our editor at app.voxabot.com! We have taken over a year to get here thanks to COVID19--but we feel it was worth the wait. Today we showcase our new features and functionality with a universal TTS editor using SSML across Google, Amazon, and Microsoft Azure. Feel free to step in and give it a whirl!

The quest to create human like speech synthesis

June 8, 2020
Cover photo by Mick Haupt @mickhaupt on Unsplash

The quest to create human like speech synthesis

An overview of 30 years of Text to Speech technologies

A long time ago..., at the end of the 20th Century

"While advances in natural language understanding and speech recognition have been impressive in the last decade, neither approaches the maturity of speech synthesis technology... This is not to say that there is no room for improvement. No one would mistake the best synthetic speech for a human being."  — Handbook of Human-Computer Interaction, 1988, Lynn A. Streeter, Bell Communication Research

A long time ago, at the end of the 20th Century, the speech synthesis generation methods were based on sets of rules which defined the formants of speech. The technology worked reasonably well and produced intelligible speech, but it produced a distinctive robotic, plain sound, which no one would mistake for a human being.

Not so long ago..., the first 15 years of the 21st Century

Around the year 2000, with the increase in computer power, TTS technology evolved to a data-based approach where the synthesized voice was built extracting patterns from speaker databases of recorded speech. There were two main approaches to extract the speech patterns:

  • Concatenative synthesis (CTTS): In this model, a database of recorded speech is divided into very small units, then the system chooses the best sequence of units, and finally splices them together to create the new word.
  • Statistical parametric synthesis (STTS): This model uses a generative algorithm to learn the distributions of the acoustic and prosodic parameters, and then generates a sequence of parameters to reproduce the speech.

In general, CTTS approach produced better speech quality than the STTS approach, but the later had advantages of scalability and cost and became the approach most widely used. These technologies are still used today with many variants, i.e. SAPI, RUS, hybrid TTS. But although the speech generated with the best of these models might sound human, no one would mistake them for a human being because of the monotonous prosody.

2017—the Neural TTS revolution begins

The use of Neural Networks systems that "learn" the right speech patterns from recorded audio, and implement them in the speech output, is changing the text-to-speech paradigm. NTTS technology has made a fantastic progress in the last couple of years, to the point that it can produce synthesized voices nearly indistinguishable from real human voices.

These are some of the fuzz words that are behind the advanced NTTS technologies being implemented today: the Wavenet neural vocoder, and neural architectures like Tacotron/Tacotron 2, Voiceloop, Deep Voice, FastSpeech, among others. The different implementations and models of NTTS strive to improve some of the issues related to neural TTS, for example, solving the tendency to skipping words of some neural TTS systems, speeding the generation of the mel-spectrogram, gaining computational efficiency, etc.

MOS
MOS

Mean Opinion Score (MOS) of different TTS technologies, as reported in the original paper presenting Tacotron 2. The MOS is expressed as a number, where 1 is lowest perceived quality, and 5 is the highest perceived quality. In this table, created for US English and with very small speech utterances, the MOS shows very little difference in perceived quality between Ground Truth and NTTS generated speech samples

The present and future of Neural TTS

We are at the beginning of the NTTS revolution. And because of the great range of applications of high quality synthesized voices, from chatbots and virtual assistants to enhanced in-car navigation systems, it is not surprising that all major IT players, i.e. Google, Microsoft, IBM, Amazon, Apple, are participating in the race to create better neural TTS voices and to create them faster.

Image from the Microsoft research blog: https://www.microsoft.com/en-us/research/blog/toward-emotionally-intelligent-artificial-intelligence/

References

  • Z. Yan, Y. Qian and F. K. Soong, "RIch-context Unit Selection (RUS) approach to high quality TTS," 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, 2010, pp. 4798-4801, doi: 10.1109/ICASSP.2010.5495150.
  • Tiomkin, Stas & Malah, David & Shechtman, Slava & Kons, Zvi. (2011). A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units. IEEE Transactions on Audio, Speech & Language Processing. 19. 1278-1288. 10.1109/TASL.2010.2089679.
  • WaveNet: A Generative Model for Raw Audio, Aaron van den Oord and Sander Dieleman and Heiga Zen and Karen Simonyan and Oriol Vinyals and Alex Graves and Nal Kalchbrenner and Andrew Senior and Koray Kavukcuoglu, 2016
  • Tacotron: Towards End-to-End Speech Synthesis, Yuxuan Wang and RJ Skerry-Ryan and Daisy Stanton and Yonghui Wu and Ron J. Weiss and Navdeep Jaitly and Zongheng Yang and Ying Xiao and Zhifeng Chen and Samy Bengio and Quoc Le and Yannis Agiomyrgiannakis and Rob Clark and Rif A. Saurous, 2017
  • VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop, Yaniv Taigman and Lior Wolf and Adam Polyak and Eliya Nachmani, 2017
  • Deep Voice: Real-time Neural Text-to-Speech, Sercan O. Arik and Mike Chrzanowski and Adam Coates and Gregory Diamos and Andrew Gibiansky and Yongguo Kang and Xian Li and John Miller and Andrew Ng and Jonathan Raiman and Shubho Sengupta and Mohammad Shoeybi, 2017
  • Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions,Jonathan Shen and Ruoming Pang and Ron J. Weiss and Mike Schuster and Navdeep Jaitly and Zongheng Yang and Zhifeng Chen and Yu Zhang and Yuxuan Wang and RJ Skerry-Ryan and Rif A. Saurous and Yannis Agiomyrgiannakis and Yonghui Wu, 2017

© Copyright Voxabot 2020

A small homage to Stephen Hawking's synthesized voice

June 8, 2020
Cover photo by Nick Fewings @jannerboy62 on Unsplash

In 1993, the British Telecom Corporation made the following much-acclaimed advertisement. The advertisment conveys an inspirational message from Stephen Hawking to mankind using his distinctive voice generated from his computer and his voice synthesizer.

"For millions of years, mankind lived just like the animals. Then something happened which unleashed the power of our imagination. We learned to talk, and we learned to listen. Speech has allowed the communication of ideas, enabling human beings to work together to build the impossible. Mankind's greatest achievements have come about by talking, and its greatest failures by not talking. It doesn't have to be like this. Our greatest hopes could become reality in the future. With the technology at our disposal, the possibilities are unbounded. All we need to do is make sure we keep talking.”― Stephen Hawking
Video of Stephen Hawking in the 1994 BT advertisement "Keep Talking"

At the time when this advertisement was broadcast, Hawking was a worldwide celebrity and his iconic synthesized voice was already a part of his identity.

In regards to his synthesized voice, in Hawking's official website we can read: "I use a separate hardware synthesizer, made by Speech Plus. It is the best I have heard, although it gives me an accent that has been described variously as Scandinavian, American or Scottish."

The speech synthesizer voice used by Hawking was created around 1980 by Dennis Klatt, who had worked on the development of DECtalk, one of the first technologies to convert text into speech. However, many years before the time of Hawking's death in 2018, Text to Speech technology had improved considerably and could produce a much less robotic sound. But as the story goes, Hawking refused to change it and preferred to use the original voice until the end of his life.

TTS technologies allow us today to replicate the voice of a person after a few hours recording of his/her voice. Hawking had lost his ability to speak in 1985 after a tracheotomy operation to save his life from infection. Had this voice replication technology existed at the time, we can only speculate what Hawking would have chosen to do. What we know for certain however, is that Hawking had a great sense of humor, as we can ascertain after watching the video Stephen Hawking's New Voice created for the nonprofit organization Comic Relief.

Hawking's New Voice video created for the nonprofit organization Comic Relief

© Copyright Voxabot 2020