“Send e-mails and instant messages, surf the web, chat and create documents-all by speaking. With (name of software), you’ll be faster than ever before-and have more fun with your PC.”
Sure it’s fun. Speak, and watch the words appear in, say, MS Word. It’s fun in the beginning, then you realise you need to train the software if you want to get any work done. OK, so you train the software a little, by reading out the pages of text it presents you (so it can get used to the nuances of your voice). Then you try it again, and it’s a little better-but it’s already less fun. The software urges you to train it some more, and you go ahead. After a lot of training, you think, “I can finally speak to my computer!”-and happily sit down to compose an e-mail by speaking.
Only to find you need to reach out to the keyboard approximately once every 10 seconds for the mistakes it makes. Mic to keyboard and back to mic. It’s no fun any more. It doesn’t work. You repackage the software-without making a copy of it, of course-and sell it on eBay. End of story.
Why?
Think about it: the sheer convenience of being able to yack away and have it all appear in an application. If these products were that good, why doesn’t everybody use it? Have you personally seen anyone using it on a regular basis? Probably not. So, well, why doesn’t it work? If that’s too harsh a question, let’s put it this way: why hasn’t voice (or speech) recognition software become as popular as it should have, given its potential?
The answer is, very simply, that speech recognition is very difficult to achieve.
Unnaturally Speaking
While speaking into your mic using dictation software, you must not cough-it’ll show up as a word. You can’t say “umm…” or “errr…” because it’ll either show up as a word, or modify the word just before that. You can’t speak too fast. You can’t speak too slow. You need to first find a sweet spot for your pace-which is difficult-and then stick to it, which is even more difficult. You can’t use the software if your spoken English varies too much from what is considered standard English. Then there’s ambiguity: trivial examples like “I scream” and “ice-cream” apart, did you say “eye” or “I”? “Right” or “write”? Then, you can’t speak out peoples’ names or other proper nouns: they won’t be recognised.
Also, there’s the huge issue of pauses: if you pause for too long-even without an “umm” or “errr”-the software gets thrown off. And what about sentence structure? When does one sentence end and the other begin? The software needs to figure this out from your pace-and this isn’t taken care of during the training! Then, punctuation! Try saying “then, punctuation!”-the software simply cannot place that comma and that exclamation mark where they should be!
One would probably also need to change one’s accent-a drastic measure, to say the least-for commercial speech recognition software to get what you’re saying. True, localised versions could be made available, but how many? The costs would be prohibitive.
We could go on, but we’ve made our point: as of today, you just can’t speak naturally into software. And if you must, it’ll be so slow that you’ll be better off with a keyboard.
I’m Unique, So Are You
If you’ve read this far, you’ll have realised that the entire problem lies in the fact that no two people speak the same way or pronounce words the same way. It would be quite easy to design a system for a single person.
How does speech recognition work? First, the phonemes need to be recognised (phonemes are the basic sounds, such as the “ou” sound in “sound.” People don’t even say the basic sounds the same way! Then, the phonemes need to be strung together to form words. There’s a problem at this stage as well: think about when you say “meet to.” The “t” in “to” is hardly audible, so it sounds like “me too.” After this, the words need to be strung together into sentences. Here’s where the gap problem comes in: in earlier systems where you spoke one word at a time, this wasn’t a problem, but in continuous speech, people speak differently when it comes to pauses between sentences.
Again, we could go into the technicalities and explain why it’s so difficult, but suffice it to say that it is!
The only way out is true personalisation, and to really personalise the software, one could conceivably employ sophisticated neural networks, which can learn from experience. These networks would have to work similarly to the way the human brain understands speech, which is a very, very complex affair. And now, we’re again talking about the day when we’ll have true AI (Artificial Intelligence). But that’s the direction some researchers are moving in. As we said, it’s in the future, and it looks like it always will be-at least until there’s a breakthrough in AI!