The last couple of years have seen a huge emphasis put on voice control interfaces. From Apple’s Siri to Google Now and the upcoming Google Glass project it seems that the future is definitely going to be a louder one. There’s no doubt that as these technologies mature they will become a central part of our interaction with devices, but they still have a fair way to go in terms of accuracy. Siri can be a frustrating device to use, especially if you have a heavy accent or even a cold. The anguish that the system can induce is wonderfully highlighted in the NSFW Youtube video ‘Apple Scotland – iPhone commercial for Siri’, which features a scotsman trying in vain to ask Siri for eating advice. The main challenge for the voice deciphering code is that it has to contend with many different factors while interpreting a user’s input. One company that has been working on overcoming these challenges on the desktop are Nuance, whose Dragon Dictate software is one of the most advanced in the industry. I spoke with them recently to discover just what it takes to write a voice interface that we can actually use.
‘Speech recognition is an extraordinarily hard computational problem’ explains Nuance’s Neil Grant. ‘Effectively you’ve got an astronomical search space. An example would be if you had a seventeen word phrase – which is an average length sentence – within a fifty thousand word vocabulary. It’s the equivalent of finding the correct phrase out of seven point six times ten to the seventy nine possibility. Roughly the amount of atoms in the observable universe. Now to put that into context when Google does a search to find a webpage for you, it’s searching somewhere around one times ten to the twelve web pages, so significantly less.
‘If you’re typing something on a keyboard it’s very simple, it’s binary – you either hit the keystroke or you don’t. With speech there’s far more variability in terms of accents, tonality, environmental conditions, background noise, and microphone quality. One of the ways we tighten that with the desktop speech recognition is that a user has a profile attached to them so the computer understands the nuances of the way they speak. The software can apply this data to achieve higher levels of accuracy, and the more you use it and make corrections, the more it learns and then applies those learnings to your profile.’
This dedicated usage is a significant factor that gives Nuance software its famed levels of accuracy. It also highlights one of the challenges ahead for the mobile software that many of us currently use.
‘Something like Siri is effectively speaker independent speech recognition’ says Neil. ‘ Now that means it’s not training a profile for you, certainly not in any great depth. You might use it on your phone then another family member might use it, so it’s dealing with potentially multiple speakers from the same device. It’s a much harder process and means it can’t set itself up in advance for a particular accent.’
Advances in noise cancelling microphones and the continued refinement of voice control software is seeing rapid improvements in all areas of the technology. Nuance itself offers iPad and iPhone versions of their software now, and the continued updates to Siri and Google Voice Search will no doubt push the software even further in the years ahead. Manufacturers are also beginning to incorporate the technology into newer versions of laptops in response to the ever encroaching influence of tablets.
‘One of the key specifications set by Intel on the new ultrabooks is embedded speech recognition’ says Neil. ‘So this is something that is absolutely coming through and what we will see is speech on these devices becoming more and more ubiquitous.’
One of the eye catching elements of Siri that Apple aggressively markets is the system wide integration of commands. Rather than a stand-alone app, Siri is able to control calendar entries, send emails, tweets, update Facebook, and play specific music to you, all from the same interface. For voice control to really make an impact on the everyday computers it needs to offer a similar level of depth.
‘We can get very very deep’ Neil continues. ‘not only dictation capabilities but real command and control of applications like MS Office. For example a chap called Stuart Mangan, a rugby player, was involved in a tackle and broke his neck leaving him paralysed from the neck down. We effectively voice enabled his entire PC, to give him not only his email and documents but, through Nokia PC Suite, he was able text messages and make phone calls. He came back to us saying that we’d given him his independence and privacy back.’
The concept of voice control has been a staple of science fiction for decades, and the representation of communicable computers such as HAL in 2001: A Space Odyssey, or even Holly from Red Dwarf, has been a constant reminder of the convenience and ease with which the interface could work – so long as the computer in question will acquiesce to opening the pod bay doors when you ask. There’s no doubt that this kind of interface is now more of a possibility, but as the way we interact with our technology changes what impact will this have on the systems of the future?
‘A mouse and a keyboard are not a natural way of interfacing with something’ states Neil. ‘They’re a solution to a problem, and they’ve been a very successful solution, but the keyboard layout was designed to slow us down. Stephen Fry came out with a very good quote a couple of years ago where he stated it took less time to get your private pilots license than it did to learn to type at sixty words per minute. So we’ve got these interfaces we’re stuck with at the moment – the keyboard and the mouse – which are fine for certain things but for others there are certainly improvements that can be made. You’re starting to see prototypes coming through, the Google Glass project for one, looking at ultra-mobility – wearable computing – and there is a necessity to change the interface. As your devices become more and more mobile you’re not going to be able to carry a keyboard around. Obviously voice is the natural step for that.’