Baidu’s Deep Voice System Is Capable of Synthesizing Human Speech in Real Time

11 Mar

By Anmol Sachdeva, The Tech Portal

Baidu, often referred as the Google of China, recently opened its research center focused on building AI and self-driving technologies in Silicon Valley. There’s a chance that you might’ve already heard about the company’s autonomous efforts but it has today shared an update on another project. The said project revolves around its new text-to-speech speech synthesis technology.

The company has shared some surprising details about this project via MIT Technology Review. Here, Baidu mentions that their team of researchers have managed to develop a text-to-speech system which is faster and more efficient as compared to Google DeepMind’s WaveNet technology. They’ve been able to achieve this feat by cutting on the behind-the-scenes work required to fine-tune the model.

Instead, they developed a deep-learning model that enables the text-to-speech system to learn to talk within a few hours, without any human intervention. Baidu’s text-to-speech system is an improvement on Google’s WaveNet technology and is called Deep Voice. It may require some initial human fine-tuning during the training period but then it can synthesize human voice all by itself.

Google WaveNet was also capable of synthesizing human speech and was first shown off in September last year. However, it is said to be quite computationally demanding and there’s no surety if this system can generate human speech natural enough for usage in real-world applications. But, Baidu says that it has solved this problem by building a simple deep learning technique that converts text into the smallest units of sound called phonemes. It can then use Deep Voice to understand and reproduce the speech pattern.

Take, for example, the word “hello.” Baidu’s system first has to work out the phenome boundaries in the following way: “(silence HH), (HH, EH), (EH, L), (L, OW), (OW, silence).” It then feeds these into a speech synthesis system, which utters the word.

Though the deep-learning model has taken control over most functions, Baidu still has control over some variables such as stress on the phonemes, their duration, and the natural frequency of the sound. This means the said system is capable of synthesizing natural and realistic human speech, even filled with different emotions. It all depends on how Baidu switches them around to produce some sounds and voices (can it make me sound like Pierce Brosnan!?)

The researchers at the Chinese giant’s AI Lab in Silicon Valley further add that they may have improved on an existing system but the same still requires too much computational power. For realistic human speech synthesis, the system needs to maintain sampling rate in the region of 48KHz and process text in 20 microseconds. Further, the company has already tested the said model and produced a ‘high quality’ result as per crowdsourced perceptions. Talking about the same, a Baidu researcher said,

To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units.

We optimize inference to faster-than-real-time speeds, showing that these techniques can be applied to generate audio in real-time in a streaming fashion.

This is significant development on Baidu’s part and makes us believe that realistic and human-like speech synthesis is possible in the long run. This adds weight to our fantasies of living alongside cybernetic beings with human-like voice, emotions, and characteristics.