July 29, 2019
A voice assistant is software that can understand and respond to commands spoken in natural language. They can also be called smart assistants and this may be a more accurate description because in many cases they can be interfaced with text over chat. Of course, they are also known as bots.
In recent years adoption of voice assistants taken off especially in the form of voice-activated home assistants such as Alexa and Google Home.
These products allow users to command software to do things just with their voices. For example, a user can play music on Spotify or play a video on Youtube just by commanding the smart voice assistant to do so.
The personal assistant device was made possible by breakthroughs in AI, specifically in an area called natural language processing.
Natural Language Processing is a technology which enables computers to understand the intention behind a spoken phrase. This is different from speech recognition which transcribes spoken words to text. Speech recognition is of course also needed for voice-controlled digital assistants. Speech recognition transcribes the spoken words to text and the natural language processing determines the user intention behind the text.
Natural Language Processing is important and useful because humans instruct the voice assistants using different phrases that have the same meaning. For example, they could say, “Play X on Youtube”, or “Please find X on Youtube and play it” or “On Youtube please play song X”, etc.
The NLP can detect that all these phrases have the same meaning. This is useful for humans, aside from the fact that they can interact with the device with voice only because they don’t need to remember an exact command or syntax to operate the device. NLP is also surprisingly easy for developers to learn how to set up and for this reason it is an important part of any bot framework.
As anyone who actually has tried to use a voice assistant will tell you, they are good for somethings but are not perfect. You cannot have a human-like conversation with them for example. The conversation will quickly break down if you try.
It’s also difficult to find out what they can or can’t do just by interacting with them. Voice it turns out is a poor interface for quickly retrieving a lot of information. Scanning a web page, for example, is a much better way of getting information quickly.
What they are very good at is one off the commands or questions. They work well especially in the case where the user knows exactly the outcome they desire, for example, they want to play a specific video on youtube that they know the name of and where the answer to a question is a simple phrase, such as the answer to “what is the temperature in my city?”.
We often forget those voice assistants are simply another software interface. We call them assistants because you can speak to them and therefore it is easy to conceptualize them as having some sort of human-like quality. This idea is further reinforced by the fact that we have to call them by name with a hot word, “Hey Google”, “Alexa”, “Siri” to activate them. If we didn’t have a hot word they wouldn’t know when they are being spoken to and therefore when to respond. The hotword does brainwash us into thinking about the voice assistant more as a kind of thinking almost human assistant than as a software interface. And it brainwashes young kids into believing that Google or Alexa are some kind of Dieties which might do them some lasting damage when they discover that these are world-dominating corporations.
In reality, voice assistants are just another software interface i.e. an equivalent for example to a graphical interface. A graphical interface performs a similar role to a voice interface but it cannot be humanized in the same way.
Voice interfaces are used differently to graphical interfaces of course. It turns out that voice interfaces are normally used in addition to graphical interfaces but not the other way round.
This is partly because graphical interfaces have already been built for most applications and therefore adding a voice interface to them allows users another way of interacting with the software. Like asking a voice assistant to play a youtube video. You could play the video using the graphical interface but it would be slower to do that.
It is also arguable that the graphical interface is more complete than a voice interface as it would be very difficult to do some tasks using voice that can be easily done on a graphical interface. To understand this point imagine trying to get your colleague to build a spreadsheet for you by giving them instructions over the phone versus building the spreadsheet yourself using the graphical interface.
While voice interfaces are usually not indispensable, they do provide a new level of convenience in certain situations. This is usually convenience you can live without if necessary except in the rare circumstances where hands-free interaction is essential.
Given their limitations, the question is whether voice assistants are going to become more important in the future or whether they will remain a fringe product.
It is clear to us that voice assistants are going to become much more popular and widely used in the future due to one reason, they are going to be fully integrated with graphical user interfaces.
While it’s hard to replace graphical user interfaces with voice, it is very feasible to combine a voice and graphical interface. This is being done to a very limited extent right now with Google Assistant (which allows a web page to provide context) and Bixby.
The next generation of interfaces that we shall call “combination” interfaces will integrate graphics, text and voice into the best experience for the user. Not only will this allow users to accomplish tasks faster and with less of a learning curve (because voice allows users to interact with software without knowing exact commands) but AI monitoring the interactions will allow the interfaces to evolve and get better on their own.
A voice instruction when the app is first launched will work differently once the app has learned from thousands of interactions what the best course of action is.
It is also interesting to consider how for voice to be fully adopted there will need to be a change in user behavior. Right now people type text and use graphical interfaces on their smartphones far more than they speak into their phones.
This is because voice recognition technology is not perfect. For decades there have been voice shortcuts on phones and computers but these shortcuts have not been widely used because the error rates were so high that the pain of accounting for the error outweighed the benefit of the convenience after the novelty had worn off.
Imagine if voice recognition was perfect and there were no error rates.
In this case, it would be far quicker for people to “type” an email, for example, using voice than by typing on their smartphone. Once this critical point is reached, voice will be ubiquitous for these types of tasks.
For bots to take off both the NLP and voice recognition technology needs to operate at a high level. While voice recognition works very well already, NLP, as we have discussed, only works well for narrow domains.
The interesting point here is that voice recognition works much better in narrow domains for obvious reasons, there are much less possible words that the user could be saying.
This means that we are already at the point of being able to create chatbots that are almost perfect in a narrow domain. Just listen to the Google Duplex Demos.
This will lead to the extremely quick adoption of voice once the discovery and related issues are solved.
The idea is that voice will be the first port of call when someone needs to do something.
In a Voice First world devices will become more invisible as people will only need to look at them for tasks they cannot use voice to do.
People won’t just have one device in their living room, they will have a cheap voice device in every room. These devices will be connected to each other, to IoT devices and to smartphones and computers. Some of these devices may be able to project images on the walls.
People will be able to ask questions or give commands while they are in the shower or brushing their teeth. They won’t have to remember things to tell the voice bot downstairs.
There will be much better ways of discovering functionality and “training” humans on how to efficiently use the bots.
While there are many problems with voice devices right now most of these problems are to do with how they are being used rather than the underlying technology. We believe in a short period of time the killer apps for voice are going to emerge and this will be a game-changing event for the way that software is used. This will also require some standardization of voice technologies and protocols, but these are obstacles that won’t impede progress for long.
We look forward to a world of ultimate convenience where voice devices are ready to help at almost any place or time.
Disclaimer: We encourage our blog authors to give their personal opinions. The opinions expressed in this blog are therefore those of the authors. They do not necessarily reflect the opinions or views of Botpress as a company.