Ever wanted to know if your company chatbot would work better on IBM Watson than on Dialogflow ? Or should you give Wit.ai or Amazon Lex a try ? With the help of Botium you can easily benchmark and compare the performance of established chatbot platforms with your own data.
I recently wrote an article about Quality Metrics for NLU/Chatbot. It shows the classical statistical approach to evaluate the performance an NLP engine with precision, recall and F1. Furthermore, the concept of a confusion matrix is introduced and how to extract the relevant information for chatbot builders.
While these performance metrics point out possible strengths and weaknesses, using training data for testing purposes has some flaws — most important, there is no challenge for an artificial intelligence to correctly classify something it already knows. It is a challenge for an artificial intelligence to classify something it hasn’t seen before, though. That’s why some clever data scientists invented cross validation. I won’t go into detail here, but the basic principle is easy to understand:
- Split the data into two parts. First part is used for training the artificial intelligence, the other part is used for testing the artificial intelligence.
- To remove flakiness, do this several times and average the outcome.
In chatbot terms, this means:
- For each intent, remove some of the user examples and train a new NLU model
- Evaluate the removed user examples and compare the predicted intent to the expected intent
- Calculate precision, recall and F1 and average over all intents
Sounds complicated, but again easy to understand in chatbot terms:
- For each intent, split the user examples into k pieces
- Use every piece except one for training a new NLU model
- Evaluate the remaining piece, compare predicted intent to expected intent, calculate precision, recall and F1 average
- Repeat 2 + 3 for another piece
You can find a good introduction to this topic here. We are using a stratified k-fold monte-carlo cross validation algorithm in the rest of the article(which basically means that it randomizes and splits user examples for all intents into the same number of pieces to remain the overall distribution of user examples per intent).
Botium, the Selenium for Chatbots, includes functionality to evaluate the performance of your training data on supported platforms and output relevant quality metrics. It is part of the Open Source Botium Stack (MIT License). Install Botium CLI before proceeding.
Prepare Botium Configuration
For supported platforms, prepare configuration files — Botium supports most chatbot platform technologies out there:
You need accounts with the providers, of course, but all of them have a free tier which is fine for starting evaluation.
The configuration files look like this, please consult the documentation pages linked above for details for a specific platform:
"PROJECTNAME": "Botium Project wit.ai",
Prepare Your Data
In case you are already using one of the supported platforms, you can extract the data — intents and utterances (user exampls) — from them automatically:
> botium-cli nlpextract --config /pathto/watson.botium.json --convos /pathto/outputdirectory --verbose
This command will write several text files to the given directories with filename pattern intentname.utterances.txt (see Botium Wiki). First line of the file contains the intent name, following lines are the utterances (user examples) — example:
Is it me or it is hot here?
What's the current temperature?
What’s the current temperature?
You can also write those files yourself, export it from another source or use one of the supported file formats for Botium (Text, Excel, CSV, YAML, JSON).
Benchmark Your Data
Now it’s time to run the benchmark. Here we are running the stratified k-fold cross-validation for your data against IBM Watson (with k=5, meaning one fifth is used for testing, four fifth are used for training):
> botium-cli nlpanalytics k-fold -k 5 --config /pathto/watson.botium.json --convos /pathto/utterances/
Botium will create a separate IBM Watson Assistant workspace for all test runs and clean it up afterwards. Your original workspace — if existing — won’t be affected. Same principle applies to the other supported platforms.
The benchmark takes some time, and the results are printed out when ready — for each of the 5 rounds and the total average over all rounds:
############# Summary #############
K-Fold Round 1: Precision=0.7653 Recall=0.7708 F1-Score=0.7680
K-Fold Round 2: Precision=0.6958 Recall=0.6875 F1-Score=0.6916
K-Fold Round 3: Precision=0.7460 Recall=0.6806 F1-Score=0.7118
K-Fold Round 4: Precision=0.7361 Recall=0.7361 F1-Score=0.7361
K-Fold Round 5: Precision=0.6931 Recall=0.7083 F1-Score=0.7006
K-Fold Avg: Precision=0.7273 Recall=0.7167 F1-Score=0.7219
Do the same for the other platforms and compare the numbers.
All supported platforms provide real blockbuster artificial intelligence, so don’t expect extreme deviations between the platforms. But small improvements can make a big difference.
Take a test drive now and tell us your results (if not confidential 😁).
Please take part in the Botium community to bring chatbots forward! By contributing you help in increasing the quality of chatbots worldwide, leading to increasing end-user acceptance, which again will bring your own chatbot forward! Start with our Contribution Guide!