Google has made its voice AI almost indistinguishable from human voice. In a research paper published by Google this month, the search giant detailed its second generation text-to-speech system called Tacotron 2, claiming the system has near-human accuracy of imitating a real person speaking from text.
Tacotron 2 reads directly from text and Google claims it can leverage context to pronounce identically-spelled words like ‘read’ and even responds to punctuation and can stress words.
The system relies on two deep neural networks. The first network translates the written text into a spectrogram which is a visual representation of audio frequencies over time. The spectrogram is then leveraged by WaveNet, a product of Alphabet’s AI research lab DeepMind which can read the chart and generate audio elements accordingly.
Google demonstrated its feat using two audio samples for two sentences. One sample from each sentence is spoken by a human hired by Google and another sample is generated by AI. Google doesn’t say which is which.
Although, if you go to the “page source” of Google research website and look at the filenames of the audio samples, one of the filename will be labeled “gen” indicating that’s the AI generated sample.
There are other audio samples as well demonstrating Tacotron 2’s ability to pronounce hard-to-pronounce words and names and nuance the speech with punctuation and pauses. Capitalised words are stressed just like someone would do when he is reading it out.
The research paper that Google published will have an immediate impact on usability. WaveNet, which was first announced in 2016 is already used to by Google Assistant to generate the voice and Tacotron 2 will only add to the quality of the service.
The system, for now, can only mimic a female voice. Google will have to train the system again to mimic a male voice or another female voice, but shouldn’t be much of a problem for Google.