In an earlier blog post on Conversational Assistants and Quality that can be found here (and here), I provided a link to a Python notebook that does an automated k-fold test and some intent and entity analysis of your Watson Assistant workspace. I have shared it with a few different Watson Assistant users, and they all seem to like the data that it provides. What they often have issues with is understanding what that data is telling them.
Part of the common set of measures and scores returned when doing statistical testing of a chatbot, are the terms Accuracy, Precision, Recall and F1 Score. We’ll go through these measures one by one.
Something to keep in mind as you read this. Don’t get yourself focused on delivering the “perfect” chatbot. You want to test and measure your chatbot, but focus on the trends and the relative values of your various intents.
What are TP, TN, FP and FN?
These represent all possible outcomes for any given intent, for our testing data. The first result is the True Positive (TP). We like TP — it means that for a given phrase, we wanted it to resolve to intent X, and our chatbot DID CORRECTLY resolve it to intent X.
Our next result is the True Negative (TN). We also like TN — it means that for a given phrase, we wanted it to resolve to some intent other than X, and our chatbot DID CORRECTLY resolve it to that other intent.
Now we get to the areas where our chatbot isn’t doing so well. The first of these measures is the False Positive (FP). False positives occur when our phrase resolves to our intent X (it’s a positive match), even though they should have resolved to some other intent (it’s false).
The final type of result is the False Negative (FN). False negative’s occur when our phrase resolves to some other intent (it’s a negative match), even though it should have resolved to intent X (it’s false).
When we do our k-fold testing, we are able to test what our chatbot model returns as an intent, and compare that with what the training data says should have been the returned intent. That will yield one of the situations above.
Accuracy is a measure of how well the chatbot is determining proper intents. It is a measure of how many times we are getting things correct, divided by the total number of measurements. Accuracy can be looked at as a percentage or a ratio — a perfect model would have an accuracy of 1.
You should think of accuracy as being how well you are predicting the right user intents. An accuracy of greater than 80% is pretty good. If you need to improve accuracy, it usually means that you need to add more training data for your intent.
Precision is a measure of how well the chatbot is determining a particular intent. How confident can I be in a prediction of some intent? It is a measure of how many times we are getting things correct, divided by the total number of times (both right and wrong) we predicted some intent. Precision can be looked at as a percentage or a ratio — a perfect model would have a precision of 1. If my chatbot predicted that intent X was indicated correctly 9 times, and once made an incorrect prediction of intent X, then we would have a precision of 0.9 (9 correct / 9 correct + 1 incorrect).
You should think of precision as how well you are predicting a particular predicted intent X. Using the above example, if my chatbot predicted intent X, I could be about 90% confident that it did so correctly. A precision of greater than 80% is pretty good. Improving precision means getting your intent better defined — your intent is “reaching” and data that should not get classified to your intent is getting assigned here. Better examples, and checking for training data that is similar between different intents (ambiguous training data) and eliminating it can help this measure.
Recall is a measure of how well the chatbot is in surfacing a particular intent. How confident can I be that my chatbot will recognize some intent? It is a measure of how many times we are getting things correct, divided by the total number of times we SHOULD HAVE predicted some intent. Recall can be looked at as a percentage or a ratio — a perfect model would have a recall of 1. If my chatbot predicted that intent X was indicated correctly 8 times, and twice made an incorrect prediction of intent Y instead of intent X, then we would have a precision of 0.8 (8 correct / 8 correct + 2 incorrect).
You should think of recall as how well you are in identifying a particular predicted intent X. Using the above example, if I am worried about catching intent X, I could be about 80% confident that I have caught all instances of intent X. An recall of greater than 80% is pretty good. Better training examples and more examples can help recall. Balancing your training data — with similar numbers of examples for similar intents, can also help improve recall scores.
The F1 score is what you share with stakeholders. It combines the above measures into a single score — one which you should be tracking over time. It is the harmonic average of the Precision and Recall scores. These scores tend to be quite a bit lower than the ratios representing accuracy, precision and recall.
Think of an intent where we have a precision of 0.95 (which is really good), and a recall of 0.90 (which is pretty good), our resulting F1 score would be 0.92. An F1 score of 0.8 or better is pretty good (since a “perfect” chatbot would have an F1 score of 1.0).
Accuracy: Accuracy measures the ratio of correct predicted user examples out of all user examples.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision: Precision measures the ratio of correctly predicted positive observations out of total predicted positive observations.
Precision = TP / (TP + FP)
Recall: Recall measures the ratio of correctly predicted positive observations out of all observations of the target intent.
Recall = TP / (TP + FN)
F1 Score: F1 Score is the harmonic average of Precision and Recall.
F1 = (2 * (Precision * Recall) ) / (Precision + Recall)