What Chatbot Authors Need to Know About the Natural Language Processing Algorithm
One common step when building a bot is defining “intents”.
An intent might be “reset my password”, “book a flight” or “contact support”. The bot developer needs to enter multiple phrases into the software that all have the same meaning as the intent. For example, “I want to fly to Paris” would be one such phrase for the “book a flight” intent.
Generally, bot development platforms ask users to enter many phrases for a given intent. These phrases are training data for the Natural Language Processing (NLP) algorithm.
The NLP algorithm is a machine learning algorithm that trains itself on the data to be able to recognize phrases with the same meaning but different words (versus the training data).
The NLP algorithm uses the intent data provided by the bot developer plus a huge corpus of data regarding the language (that it has been trained on previously) to calibrate its internal model to be able to recognize new phrases.
The more examples given to the NLP algorithm by the chatbot developer, the more accurately it will be able to recognize the same meaning in other phrases that have different wording. At least that is the message given to bot developers.
The problem is that all training data is not created equal. The quality of the data matters as much as the quantity.
For example, imagine I want to create an intent called “reset my password”.
A bot author might start creating the following phrases:
Reset my password
I have forgotten my password
My password is not working
New password please
The problem with the above is that all the phrases use the same word “password”. This means that when the algorithm trains itself on this data, it discerns the rule that if the word “password” is in the phrase, then the intent is “reset my password”. This, of course, is wrong. People can say many other phrases without the word “password” in them that have the same meaning as “reset my password”. There are also many phrases with the word “password” in them that don’t mean “reset my password”.
In the world of algorithms, this problem is called overfitting to the training data. The algorithm has overfitted to the word “password” and therefore “believes” that every phrase with the word “password” in it means “reset my password”.
There are other examples of this for the same intent. For example the bot developer could enter the following phrases:
My credentials aren’t working
My login isn’t working
My password isn’t working
My username isn’t working.
This is, of course, is a more extreme example of the problem, but the pattern is common when creating training data. This again will cause the algorithm to overfit to the data, but this time to the phrase “isn’t working”.
The solution is hopefully obvious by now. Each phrase in the training data needs to be as different as possible to the other phrases in the data set. For example:
My credentials aren’t working.
I need a password reset.
How do I fix my login problem?
Who can help me with signing into the system.
Of course, creating a data set such as the above takes more effort. It may even help to have a thesaurus open to find synonyms for the purpose of stimulating ideas for phrases.
The other way that bot developers overcome this problem is by having access to customer service chat data which provides many examples of all the ways a real customer would ask the same question. This data can be extremely valuable.
There is a question about when NLP algorithms will be able to perform well on just a small training data set. It would definitely be better to have the bot work perfectly using only a small dataset. This is definitely something that is being worked on by researchers as it would not only reduce the time and effort needed to create chatbots, it would vastly improve their quality.
The NLP algorithm is a black box to most bot developers. It is important however that they understand the basics of how the algorithm works so that they know what kind of training data (intent data) they need to provide to the algorithm to yield the best results.