Google has made having a natural conversation with a machine possible. At the I/O 2018 Developers Conference, Google showcased the Google Assistant having a remarkable convincing conversation over the phone to book a haircut appointment and to make a restaurant reservation. It’s a long-desired ability for machines to speak naturally, indistinguishable from a human and by the looks of it Google has made it possible. Needless to say, the ability is dependent heavily on machine learning and artificial intelligence. But that’s a commonly used hotword being thrown around everywhere. Thanks to a detailed blogpost on the new Google AI blog, we have some insight as to how Google made it all possible.
The underlying technology that Google is using is called Google Duplex. It’s named after a common communication system that allows two or more parties to have a two-way conversation by both listening and talking at the same time. It’s like a two-way lane highway where cars are going in both the directions without interruption. In contrast, a half duplex is like a radio conversation where you have to push to talk and then take your hand off the button to listen to the person on the other end. AI-based communication, so far, has relied on the half-duplex system.
To go full duplex, Google made advances in understanding, interacting, timing and speaking. Talking to robotic computerised voices that don’t understand the nuances of natural speech is quite frustrating. To change that, Google has addressed several challenges computers face when trying to emulate natural speech. It’s difficult to model natural speech with the expected latencies, fast processing, generating natural sounding speech with the right emphasis and intonations. Furthermore, when people talk to one another, they use complex sentences — correcting themselves mid-way, saying more than they require, omit certain words and rely on the context along with using a wide number of ‘uums’ and ‘aahs’.
For an AI to do all of that will require certain advancements and Google has accomplished it, based on the demo conversations we heard at the opening keynote of I/O 2018.
At the heart of Google Duplex is a recurrent neural network (RNN) that is built using TensorFlow Extended (TFX). RNN is a kind of AI neural network which is used in handwriting recognition or speech recognition. TFX is Google’s end-to-end machine learning platform that leverages that leverages the company’s TensorFlow software library. There are a lot of things at work here and I’ll try to break it down to bite-sized pieces —
How can a machine speak like humans?
Remember Google automatic speech recognition that engages everytime you ask Google Assistant to do something? the neural network in Duplex starts with that. When listening, it takes into account the history of the conversation, the parameters of the conversation and other features of the audio. The model is trained separately for each task but the learnings are shared across tasks to help each aspect of the model to learn from one another. On top of that, Google optimises the task using TFX to further improve the outcome.
Once the AI is able to understand the nuances of natural speech, what’s left is for it to speak naturally. For that, Google uses a combination of its standard Text-To-Speech (TTS) and the new WaveNet and TacoTron technology. Late last year, the second generation TTS technology came into being at Google’s AI labs. By relying on a visual representation of audio frequencies, WaveNet which is a product of Google’s DeepMind AI managed to read like a normal human being.
Now, taking pauses in between speaking a very normal human thing to do. To imitate that, Google replaced the typical wait times for the AI to process to speech with the usual pauses (‘hmm’s and ‘aah’s) which allows the AI to have the expected latencies while speaking.
But there are times when the latency in the speech is not required. Like when someone on the other end says ‘hello’, an immediate response is expected. Taking a pause while answering then is counterproductive. In such cases, the AI relies on faster models with a lesser degree of confidence. Under certain circumstances, the AI doesn’t even use the RNN and instead use faster approximations that have a high chance of errors, riddled with pauses and hesitant responses. It’s similar to what a normal person would do if they didn’t fully understand the other person. That allows Google to have less than 100ms latency in certain situations.
The system is no-doubt ingenious, but isn’t foolproof. Really complex conversations are still a challenge. For that, Google has built a self-monitoring ability in the AI. When it faces such daunting conversations, the AI refers to a human operator who then completes the task. But that’s a temporary stopgap solution. As a result, it still will not pass the Turing Test, that was designed by computer scientist Alan Turing to know when a machine sounds indistinguishable from a human being.
Will Skynet be real?
The Google Duplex technology, as announced by Sundar Pichai on stage, is brand new. It’s more of an experiment and Google will first conduct extensive tests within the Google Assistant starting this summer. Lucky users in select areas will be able to make restaurant reservations, schedule appointments and check whether an outlet is open or not.
The potential of Google Duplex extends far beyond just making reservations and booking appointments though. The technology is being compared to that of Skynet’s which was a sentient robot that could hold a natural conversation in the Terminator movies. It can easily be misused to impersonate a real person to accomplish malicious tasks. Furthermore, the technology could give way to machine-to-machine communications in natural language, which will set off a whole new digital revolution altogether. The prospects of a machine learning to speak naturally like a human being is far reaching. There are both useful scenarios and use-cases where it can be misused. But one thing is undeniable — It’s certainly one of the most exciting announcement we heard in a long, long time.
Fiction has been a point of inspiration for the technology we use today. 2001: A Space Odyssey showed us the tablet, and for the past few years we have been using the iPad. Science fiction movies like iRobot, The Matrix, Ex Machina and more talk about machines taking over the world. With what Google has announced at I/O 2018, have we started moving in that direction?