Usually we use the Conversational AIs for building chatbots as black-boxes, without knowing how they work inside. While benchmarking the different NLPs with a big corpus, there are some technical findings that can help us to understand better what is inside those black boxes, and also to understand a bit better the results of the benchmark.
You can see here the results of the benchmark.
One the most remarkable things is the false positives. There are sentences that return positives for intents which training does not contains any word from the tested sentence. The best example during the benchmark is that the sentence “de” has scores for intents that are not none.
When you say “de” to the Microsoft LUIS NLP it returns that is the intent “smalltalk.hello” with score 0.154514238. The funny thing is that the training data for this intent does not contains the word “de” and not even this combination of characters. Those are the sentences used for training “smalltalk.hello”: “Good morning”, “hey”, “hi”, “hello”, “howdy”.
IBM Watson Assistant returns that “de” is the intent “smalltalk.myhelptopics” with score 0.3767250418663025, and as in the previous situation, none of the utterances used to train this intent contains the word “de” neither the combination of characters as part of any word.
Same situation in other providers: Dialogflow returns that is the intent “building.transportation” with score 0.33309156 but in this case is because one of the utterances used for training this intent contains the word “de”.
What we can know from that is that Microsoft LUIS and IBM Watson Assistant is very probable that they don’t use a bayessian approach (the probability of “de” should return 0), neither an LSTM or RNN approach (the characters “de” does not appears in the training utterances). What can explain that this “de” causes a false positive for this intents? This can be explained by a residual weight of “de” in a neural network, probably a dense connected. Also, for Dialogflow, as “de” is part of the utterances of this intent, we cannot discard any of the methods mentioned before.
In NLP.js we had the same situation of false positives in the past, and the way to solve it was that for each entering utterance we calculate the stems and we set to 0 the score of each intent that does not contain any of the stems of the utterance, just before the normalization to calculate the final score. With this feature activated, the intent is “None” with score 1, what is correct. With this feature deactivted we obtain that the intent is “None” with score 0.38187459505470855 but the second intent is “smalltalk.hello” with score “0.3324619432848479”.
The main difference between Microsoft LUIS and IBM Watson is that when checking LUIS with utterances never seen by the AI (remember that “de” was part of an utterance) like “ax” or “ze”, LUIS returns exactly the same results that for “de”, while Watson does not return any intent for utterances with stems never seen by the network, that is also a good approach to avoid false negatives, but as it is at a higher level than the classification that does not filter the fact that an utterance can trigger an intent which training data does not contain any word from this utterance.
One of the techniques used in NLP is to remove the stopwords, that is, words that does not add too much meaning to the sentence. When testing Microsoft LUIS with words like “at”, that are part of the training data, we obtain the same results as using words that are not part of the training data like “ax” or “ze”. That probably means that the word “at” is removed when tokenizing and stemming the sentences.
It may be controversial but, removing the stopwords makes the neural network to fail much more than using all the words from the sentences. Using the benchmark example, NLP.js fails only 3 utterances from the test set when it does not remove stopwords, but if we activate the option to remove stop words, it fails 13 utterances from the test set.
So removing stopwords is not a good idea if you want confidence in the results of your NLP.
When you have a corpus to train a conversational AI, you expect that if you use exactly the same corpus in the same platform you will obtain exactly the same answers for utterances. I mean, if you export the chatbot in Microsoft LUIS and you create a new chatbot importing this data, you expect that if you say a sentence then both chatbots should give the same answer.
But this does not happen with Microsoft LUIS: using the same training data generates different scores for the same utterance.
Why this happen? When having Neural Networks one of the doubts is how to initialize the weights before starting the training. In NLP.js the same training data generates exactly the same answers because the initialization of the weights is always to put all the weights to 0, so there is no “random” data in the initialization neither during the training. But one of the techniques to initialize the weights is to use random noise, and that means that two different trainings will not start with exactly the same numeric situation.
So we can thing that this is what is happening in Microsoft LUIS: they are initializing the network with some noise.
One of the steps for the NLP is the stemmer or lemmatizer: from one word obtain the stem or the lemma, example: “developer”, “developed” or “development” have the same stem “develop”.
For doing that in english we have the Porter Stemmer, the Lancaster Stemmer and WordNet Lemmatizer.
In the training data we have the utterance “privatize canteen” so let’s try to analyze the stemmer used:
Well, we have something strange here. “priva”, “priv”, “privatiz” and “privati” have exactly the same effect as using a non existing word like “flergh” or “blurghghghg”. But the weird thing is that “privat”, that is the stem of “privatize” according to the Porter Stemmer process, makes the neural network to return the right intent but with a very slow confidence.
If we repeat the process with other possible variations of privatize, we see that the variations does not produce a positive impact in the score, they behaves exaclty as a non existing word.
So we can have an hypotesis: Microsoft LUIS does not have an stemmer. The only weird thing here is, why “privat” have so huge negative impact?
In NLP.js we use stemmers for the languages, we only don’t use stemmers in languages not fully suported, like fantasy languages. The use of stemmers is very important to achieve a better score, and this is more important in languages with so much conjugations like spanish, where the same verb can have so much ways to be written depending on the person, number and time. In english we use Porter Stemmer because is less aggresive than Lancaster Stemmer, and shows better results in all the english corpus that we tested.
When an NLP returns the scores for the intents, one measure to know if it’s performing well and is pretty sure of the answer, is the clarity: measuring the gap between intents. Let’s explain it with an example: supose that you pass an utterance and the NLP returns that the intent is “hello” with score 0.88, but the second scored intent is “goodbye” with an score of 0.85… then the gap between both scores is too low, and we cannot be so confident with the result as if the intent “goodbye” has an score of 0.4.
Using the corpus and tests of the benchmark, and a threshold gap of 0.1, the results of Microsoft LUIS are that have 48 utterances from the corpus that returns a bad clarity (distance from scores of the first intent to the second intent of less than 0.1) and 19 utterances from the tests data with bad clarity. Here you have some examples:
Also the problem is that this low gap happens even when Microsoft LUIS is telling us that he is very sure (more than 0.8 score) of the answer.
With NLP.js the same corpus and tests and the same clarity threshold (0.1) we obtain that none of the training utterances has a clarity issue, and only 7 of the test utterances has a clarity issue. Also, this small gap happens only with scores of less than 0.53, when the NLP is not sure of the answer, and in 4 of this 7 the score is less than 0.5 so the NLP will return that the intent is “None” because is not sure of the answer.
A chart showing the clarity issues figures:
Another measure to take into account is the confidence: when the NLP returns the correct intent it returns also a score representing how much confident is the NLP about the answer. So if we check all the correct answers from the NLP we can create a distribution of the confidence: how many answers were in the range of 0–10% confidence, how many answers in the 10–20% range… until the 90–100% range.
From the confidence what we want from the NLP is that given the correct answer from a corpus utterance the most frequent range should be the 90–100%, because the NLP can be very sure about the answer, because is a data that it already know from the trainings. From the tests data we are not so optimistic but we expect a growing curve, so there are more answers in the range 90–100% than in the range 80–90%, more in the range 80–90% than in the range 70–80%…
Those are the Microsoft LUIS frecuencies of the scores for correct answers:
As we expected, the corpus confidence is very good, with almos all the answers in the 80–90% and 90–100% ranges. Even with that, I expected more answers in the 90–100% range, but the results for the corpus are good.
The problem is at the tests confidences: we can see that is not the expected growing curve, is more irregular, with too many utterances were the NLP returns the correct intent but with a very low confidence in the result. Also we see that the 90–100% range is not the biggest.
That’s the confidence chart for NLP.js for the same corpus:
We can see that NLP.js is very confident with the data used during the training, and for the tests data is not a perfect growing curve but is more what we expected, with most of the correct answers in the 90–100% range.