It’s been exciting to watch the explosion of voice interface technology and natural language processing in recent years. The trend is largely driven by the latest voice assistants: Alexa (Amazon), Google Assistant, Siri (Apple), Tmall Genie (Alibaba), and Samsung’s Bixby; with more new devices being released every year. These devices have a surprising amount of processing power built into them, and they’re also backed up by cloud software that processes speech and understands intent surprisingly well. They’re enabling human-machine interactions that more closely mimic human-to-human communication than we’ve ever seen in the mainstream before. Intuitive interfaces like these are crucial building blocks of the future smart home (not to mention other spaces like factories and hospitals) where technology blends into our surroundings, improving our safety and productivity with minimal disruption. But, the state of the art today doesn’t get us all the way there. We’ll focus on one of the major impediments—the wake word—and how recent advances in technology are allowing us to improve user experiences beyond it.
Why Do We Need Wake Words?
Cloud-based platforms have sufficient computational power to perform sophisticated natural language processing (NLP)—that is, figuring out what we mean even though we could have said it in many different ways. The embedded processors on the device, meanwhile, detect speech that is intended for the device; otherwise, the device would have to continuously stream audio data to the cloud, which is problematic for both bandwidth and privacy reasons.
While most devices support a manual indication that the user is about to speak a command, such as a physical button press, today’s embedded processors are powerful enough to detect a “wake word” amidst other conversation or background noise: “hey, Siri,” “Alexa,” “hey, Google,” or “天猫”精灵” (“tiān māo jīnglíng).” This has worked quite well for major voice assistant platforms, as using pre-chosen names to address the device make it clear to us as users that our commands and requests are being handled by our brand of choice.
However, addressing a device by a name associated with the institution behind it doesn’t particularly resonate with the ways humans are accustomed to addressing each other.
What’s Wrong With Using a Wake Word?
In human-to-human communication, there are myriad other ways we might signify whom (or what) we’re targeting with our speech. To figure out who (or what) is the most likely intended recipient, we use both the content of the speech, and contextual information. For example, we ask ourselves: who’s the speaker facing, gesturing towards, and/or looking at? What current ongoing conversations, and memory of past conversations, have the speaker and possible recipient engaged in? What is the speaker’s tone of voice? We might also make judgements about how we interpret somebody’s speech based on an enormous possible amount of background information such as general knowledge about the speaker (gender, age, personality, preferences, schedule, appearance, etc…) and outside information (location, immediate surroundings, time of day, weather, current news, …).
In human-to-human communication, there are myriad other ways we might signify whom (or what) we’re targeting with our speech. To figure out who (or what) is the most likely intended recipient, we use both the content of the speech, and contextual information.
Of course, addressing others by their name is high on this list, and giving names to individual devices will continue to be a major interaction enabler—but in the future, it won’t be required. Our human brains are well adapted to figuring out the intended recipient of speech given a vast possible array of contextual priors. It will be a long while before technology will be able to emulate this capability, especially on our devices’ embedded processors. However, that doesn’t mean recent developments in technology can’t move us forward.
For product developers, the major voice assistant players are starting to provide easy-to-integrate hardware and software development kits which allow control of other devices via their platforms (e.g. “Alexa, turn on the living room light”). It certainly feels natural to address a whole-home assistant in this manner.
However, these off-the-shelf solutions have drawbacks for bespoke products. For one, name-like wake words tend to “anthropomorphize” the device, leading the user to expect that it has the ability to provide human-like interaction, which may not be the case for a simpler user interface. That said, we’re excited by new technologies that are enabling the possibility of a more custom user experience.
Customizing Wake From Speech
Today’s available voice assistants require us to use a single predefined wake word, or to select from a small number of choices. Building a device with a custom wake word, maybe one that is more appropriate for its function or brand, generally requires a difficult process of training a computational model on hundreds of thousands of samples of that wake word. How can we make it much easier to generate new wake words—either pre-defined for a new device, or allowing users to assign their own “names?”
There are several groups doing research into this problem, and some companies have sprung up to offer custom wake word detection platforms in recent years, like Snips.ai, Mycroft.ai, Nowspeak, and PicoVoice; in addition to more established players like Sensory and Nuance Communications. Any one of these might be a good fit for a particular project’s requirements, so we are working to evaluate their latest offerings and track how they incorporate the latest research. We are also closely following development of systems which can train new wake words based on a very limited number of samples from a user; some of which are even capable of training their models without communicating with the cloud.
The second challenge of custom wake words is efficiently implementing them on a device. This can be relatively straightforward using pre-specified hardware reference designs for devices that don’t have constraints of power consumption or physical size. For projects that do have such constraints, there are hardware system designs and algorithm architectures that make the most efficient use of small, low-power processors. When feasible, we can also explore recently-released specialized coprocessors for running deep neural networks, or even custom ASIC designs for voice processing.
Wake From Visual Cues
Last year, in an exploration of gesture tracking for visual context we developed a concept robot, Gerard, which brings artificial intelligence closer to real human interaction. Since then, we have been working on proofs of concept for “visual attention tracking”—understanding where in space people are looking at any given time. While this has many potential applications, replacing wake words is one of the most interesting. Imagine, instead of saying “hey, Alexa, what is the weather,” you could simply look at your Echo and say “what’s the weather?” Such a system can also provide the location of the person looking at the device, such that a directional microphone could filter out all but the speech being produced by that person (as opposed to others who might be speaking nearby).
In conjunction with low-power architectures for custom wake word detection (as discussed above), we’re working on a proof of concept of a system in our Natural UI Lab which can recognize faces, find eyeballs, and determine gaze direction relative to a camera.
That said, what if your voice assistant understood when there is only one person in the room? In this case, it is cumbersome to require attention in the first place, since it can be assumed that you are addressing your device with any speech—especially if we employ current technologies which can easily distinguish commands via sentence structure, or from tone of voice.
Wake From Conversation
Many years from today, voice interfaces will be able to understand when they are being addressed using true natural language understanding and a full complement of contextual cues; but this level of AI is still in its infancy and will require a level of processing power not currently available in an embedded system. In the meantime, there are some simpler techniques that a custom device might employ.
One possibility arises if a device’s interface is simplified such that it can only accept a very simple set of voice commands, and not provide a full conversational interface. This is the case for voice interfaces that are embedded inside a custom device with a particular function—for example, the set of commands to control a coffee pot is relatively limited, whereas Siri is intended to control any number of apps and home devices. In this situation, it is feasible for the device to simply listen for valid commands (and possibly some appropriate cadence of non-speech surrounding them) without needing to send data to the cloud for processing.
Another example: it shouldn’t be necessary to use a wake word to follow-up with a previous interaction. A user should be able to say “Hey Google, what time is it?” and upon receiving the answer, say “ok, what’s the weather today?” In fact, Google has recently rolled out this technology called “continued conversation.” Similarly, by relying on speaker identification (is it the same user speaking?) and assumptions about the cadence and timing of a back-and-forth conversation, a custom device could easily determine that the follow-up utterance was intended for it.
The Future of Voice UI
It’s an exciting time to be thinking about how voice will change user experience for smart home devices, medical devices and hospital equipment, and industrial machinery. Custom device makers are just starting to tap into the possibilities of enabling features that are driven by voice interfaces. These new experiences begin with the ability of devices to understand when they are being addressed, and although there’s a handful of alternatives to the currently-standard wake words, there are plenty more possibilities to explore!