September 18, 2019
While many in the industry might argue that the “next big thing” in tech is blockchain, AI replacing human workers or augmented reality, there’s one crucial technology that’s being underestimated: the voice user interface.
Research suggests that 50% of search queries will be done through voice search by 2020. What this research is underestimating, however, is that small improvements in the voice UI have the potential to completely shift the current human-computer interaction paradigm. This goes way beyond the search use case towards the voice UI replacing or deeply integrating with graphical user interfaces and apps.
The voice UI allows people to communicate with devices in natural spoken language through smart speakers and other devices, currently found in devices such as Alexa or Google Home. Speaking is fundamental to the way we get things done with other humans, and it will be fundamental to the way we get things done with computers in the future.
This is currently a fringe opinion, however.
While most tech experts would agree that voice will continue to evolve its current niche role in the technology ecosystem, or at the very least grow incrementally as the technology improves, my prediction is that voice is the main event in itself. It will come to dominate our interactions with software and devices, and even become as important as the graphical user interface.
As mentioned, this is not a mainstream opinion. Many industry experts recognize that voice is still a novelty and hasn’t achieved a perfect product-market fit yet. Some prominent VCs, for example, have the opinion that until generalized artificial intelligence is achieved, voice technology will always be very niche.
Because of the many current limitations of voice assistants, it’s hard for people to imagine voice as the next wave of technology. In my view, voice today is similar to the dial-up web in the early ‘90s. Back then, the online experience was so bad that it was hard to envisage what would be possible once bandwidth improved. Leading thinkers made all sorts of predictions for the internet that look absurdly conservative in hindsight — some experts even predicted that it would have no more impact on the economy than fax machines.
People's expectations for voice are similarly conservative today, in part because of how rough the voice experience still is. The assumption is that until generalized artificial intelligence is achieved, bots will perform poorly in conversations — and the technology will never be great until the chatbots are capable of a close-to-human conversation with the user. However, this assumption regarding the need for generalized artificial intelligence is flawed: There are certainly ways of getting chatbots to achieve a nearly-human level performance using current technology.
For general smart speaker assistants, the topic coverage is so broad, that they need to be almost totally self-learning. Unfortunately, the current technology isn’t good enough to automatically create self-learning bots that can handle multi-turn conversations with humans. If that technology did exist, we would be able to ask follow-up questions on Google. But having smart bots build themselves is like trying to make a smartphone app build itself without any human involvement — it’s simply not possible at the moment.
There is another way to achieve nearly-human level conversation with bots: drastically narrow their scope. Just like for apps, developers can create sophisticated bots for specific tasks, manually programming them to engage in meaningful conversation. It is with these sorts of bots that the breakthrough for voice will come: smart speakers, phones and other devices will host these sorts of bots, creating big opportunities for the first movers who get things right.
To intuitively grasp the difference between the experience of current voice bots and what this technology will look like in the future, we need to start by understanding why a voice device is currently the equivalent of surfing the web on a dial-up modem.
Firstly, the basic interactions with a voice bot are still very poor. You have to specifically address the device with a hot word, after which you must wait to see if the bot was successfully activated or not. If it was activated, you need to speak after the beep at a slow but consistent speed and formulate your sentences to include all the necessary parameters — almost like you are speaking in a SQL statement. If you pause to think at any moment, your interaction will fail and you will need to return to the start.
Let’s look at a real-life example:
You say, “Hey, Google.”
There is a pause as you wait for acknowledgment that the device has been activated.
If it has been activated, you continue with your request:
“Play ‘Dark Horse’ by Katy Perry on YouTube, on the living room TV.”
There is another delay while the device processes what you have said.
If your request is successful, something will start to happen on your TV and the video will play.
If it’s not successful, you have to go back to the beginning and try again, perhaps with a different sentence structure, different words, or just trying to speak more clearly.
This experience is full of delays, potential errors and can take many restarts to accomplish tasks. In addition, the voice bot is not yet smart and won’t respond to related commands or queries regarding what you’re doing.
The easiest way to imagine interactions with smart bots of the future is by picturing a human operator controlling the device and giving it instructions specifically with regards to operating YouTube (and nothing else).
The first difference is in the speed of interaction. You could speak to the “human” operator at a normal speed, with no pauses or delays in response, and no problems if you paused while speaking. You could also reference the human operator in the middle of a sentence — for example, “I want to watch TV — you know what, Alexa, please put something on YouTube.” In fact, you may not have to say their name (the hot word) at all to get them to respond.
This human-like bot would also be flexible in terms of how they interacted with you:
You: “Alexa, I want to watch YouTube.”
Alexa: “Sure, on which TV?”
You: “On the kitchen TV — maybe something by Katy Perry.”
Alexa: “Do you have a particular song in mind?”
You: “No, what can you suggest?”
Human: “‘Roar,’ ‘Dark Horse’? I’ve put more suggestions on the screen.”
You: “Great, thanks. Play ‘Hot and Cold.’”
This is the future of bot interactions: Seamless, smooth and easy to talk to about the task or topic at hand. Imagine a vast universe of these bots with an equally vast universe of cheap, commoditized voice devices. It will be like having a human operator standing in every room and beside every device. There will still be plenty of graphical UIs, but they will be much easier to use through the bot.
Today, it’s common to see employees in places such as metro stations, airports and supermarkets providing assistance to those using self-service touchscreens — as an example, the person who helps you use the check-in machines to get your boarding pass at the airport. Imagine, however, that this person could actually directly interface with the check-in application — meaning that halfway through the check-in process, you could tell the machine that you want to change your seat from the position you originally chose, and the application would bring up the relevant screen for you — all without the help of a human assistant.
This is the future: a voice bot will be embedded in or accessible to every device or service you want to engage with, and will instantly do what you command. You’ll no longer need to whip out your phone or laptop to get something done — instead, all you’ll need to do is say aloud what you need, and everything will fall into place from there.
The move to voice will ultimately be about something as simple as convenience. In our modern world, people want to do things fast with the least amount of hassle, and speed matters more than ever before. Though a majority of those connected to the chatbot industry aren’t currently anticipating it, those of us who are researching and developing the technology foresee massive implications for business operations, marketing, sales, branding, product distribution, and more. Voice is the future of technology, and we’re already halfway there.
Disclaimer: We encourage our blog authors to give their personal opinions. The opinions expressed in this blog are therefore those of the authors. They do not necessarily reflect the opinions or views of Botpress as a company.