When starting the development of a chatbot, one of the critical decissions is to choose the conversational artificial intelligence to use, and to take that decission we have to evaluate how they behave. One of the most relevant papers around this topic is SIGDIAL22 that propose 3 corpus (sets of data) to test that. The problem around that is that the amount of data used for the study is far from a real bot. A description of the three corpus:
- Chatbot: Train using 100 sentences classified into 2 intents.
- Ask Ubuntu: Train using 53 sentences classified into 5 intents.
- Web Applications: Train using 30 sentences classified into 8 intents.
But, ¿how this conversational AIs works in a real situation?
We have evaluated them using a real chatbot with 854 sentences (in english) classified into 126 intents, 127 if we take into account the intent “None” that means that the input sentence does not classify to any of the training intents. We also used a test dataset with 82 sentences asked to the bot from real users, where none of them are part of the training set.
Those are the evaluated systems:
We also wanted to evaluate Amazon Lex, but the limitations of 100 intents per bot means that we cannot compare the results with the other providers, because for the tests we had to remove 26 intents, and we tested it with only 723 sentences from the total of 854.
The first evaluation is, after training with those 854 sentences classified into 126 intents, is to check every sentence of this training data, where we expect that the returned intent from the system is the one used to train.
Those are the amounts of errors (sentences not classified to the correct intent) using the training sentences:
- Microsoft LUIS: fails to predict the intent of 20 sentences
- Google Dialogflow: fails to predict the intent of 27 sentences
- IBM Watson Assistant: fails to predict the intent of 27 sentences
- SAP Conversational AI: fails to predict the intent 10 sentences
- NLP.js: fails to predict the intent of 1 sentence
The test data have 82 sentences, said by real users, that are not in the training, so the conversational AIs have never seen them, but they must match them into the correct intent. This is the amount of errors of each one:
- Microsoft LUIS: fails to predict the intent of 18 sentences
- Google Dialogflow: fails to predict the intent of 19 sentences
- IBM Watson Assistant: fails to predict the intent of 6 sentences
- SAP Conversational AI: fails to predict the intent of 20 sentences
- NLP.js: fails to predict the intent of 3 sentences
Those are the total results for each one:
- Microsoft LUIS: fails to predict the intent of 20 training sentences and 18 test sentences
- Google Dialogflow: fails to predict the intent of 27 training sentences and 19 test sentences
- IBM Watson Assistant: fails to predit the intent of 27 training sentences and 6 test sentences
- SAP Conversational AI: fails to predict the intent of 10 training sentences and 20 test sentences
- NLP.js: fails to predict the intent of 1 training sentences and 3 test sentences
Taking a look into the last chart we can see that SAP Conversational AI is very good guessing intents from the training data, but no so good generalizing to sentences never seen by the AI. This is called “overfitting”, when an AI is fitted to much to the training data but does not work so well with other data.
On the other hand, Watson Assistant is just the opposite situation: is very good at generalizing to other data, but no so good with the data used in the training.
And then we have Microsoft LUIS and Google Dialogflow, very balanced, but at the average worst than the other competitors.
NLP.js it is an open source project, slowly but steadily getting more traction from the development community. Latest updates have improved quite significantly its performance. If you would like to help improving NLP.js results, it is open for contributions or if you want use it for your own projects, I would love to hear how it performs.