Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal

Use your voice to play a song, artist, or genre through Amazon Music, Apple Music, Spotify, Pandora, and others. With compatible Echo devices in different rooms, you can fill your whole home with music.

Buy Now

Wireless Rechargeable Battery Powered WiFi Camera.

Wireless Rechargeable Battery Powered WiFi Camera is home security camera system lets you listen in and talk back through the built in speaker and microphone that work directly through your iPhone or Android Mic.

Buy Now

Benchmark Your Chatbot on Watson, Dialogflow, and more


Florian Treml

Ever wanted to know if your company chatbot would work better on IBM Watson than on Dialogflow ? Or should you give or Amazon Lex a try ? With the help of Botium you can easily benchmark and compare the performance of established chatbot platforms with your own data.

I recently wrote an article about Quality Metrics for NLU/Chatbot. It shows the classical statistical approach to evaluate the performance an NLP engine with precision, recall and F1. Furthermore, the concept of a confusion matrix is introduced and how to extract the relevant information for chatbot builders.

  • Split the data into two parts. First part is used for training the artificial intelligence, the other part is used for testing the artificial intelligence.
  • To remove flakiness, do this several times and average the outcome.

In chatbot terms, this means:

  1. For each intent, remove some of the user examples and train a new NLU model
  2. Evaluate the removed user examples and compare the predicted intent to the expected intent
  3. Calculate precision, recall and F1 and average over all intents
Image From Wikipedia
  1. For each intent, split the user examples into k pieces
  2. Use every piece except one for training a new NLU model
  3. Evaluate the remaining piece, compare predicted intent to expected intent, calculate precision, recall and F1 average
  4. Repeat 2 + 3 for another piece

You can find a good introduction to this topic here. We are using a stratified k-fold monte-carlo cross validation algorithm in the rest of the article(which basically means that it randomizes and splits user examples for all intents into the same number of pieces to remain the overall distribution of user examples per intent).

Prepare Botium Configuration

You need accounts with the providers, of course, but all of them have a free tier which is fine for starting evaluation.

The configuration files look like this, please consult the documentation pages linked above for details for a specific platform:

"botium": {
"Capabilities": {
"PROJECTNAME": "Botium Project",
"WITAI_TOKEN": "...",
"WITAI_LANG": "en"

Prepare Your Data

> botium-cli nlpextract --config /pathto/watson.botium.json --convos /pathto/outputdirectory --verbose

This command will write several text files to the given directories with filename pattern intentname.utterances.txt (see Botium Wiki). First line of the file contains the intent name, following lines are the utterances (user examples) — example:

Is it me or it is hot here?
What's the current temperature?
What’s the current temperature?

You can also write those files yourself, export it from another source or use one of the supported file formats for Botium (Text, Excel, CSV, YAML, JSON).

Benchmark Your Data

> botium-cli nlpanalytics k-fold -k 5 --config /pathto/watson.botium.json --convos /pathto/utterances/

Botium will create a separate IBM Watson Assistant workspace for all test runs and clean it up afterwards. Your original workspace — if existing — won’t be affected. Same principle applies to the other supported platforms.

The benchmark takes some time, and the results are printed out when ready — for each of the 5 rounds and the total average over all rounds:

############# Summary #############
K-Fold Round 1: Precision=0.7653 Recall=0.7708 F1-Score=0.7680
K-Fold Round 2: Precision=0.6958 Recall=0.6875 F1-Score=0.6916
K-Fold Round 3: Precision=0.7460 Recall=0.6806 F1-Score=0.7118
K-Fold Round 4: Precision=0.7361 Recall=0.7361 F1-Score=0.7361
K-Fold Round 5: Precision=0.6931 Recall=0.7083 F1-Score=0.7006
K-Fold Avg: Precision=0.7273 Recall=0.7167 F1-Score=0.7219

Do the same for the other platforms and compare the numbers.

All supported platforms provide real blockbuster artificial intelligence, so don’t expect extreme deviations between the platforms. But small improvements can make a big difference.

Take a test drive now and tell us your results (if not confidential 😁).

Read More


Please enter your comment!
Please enter your name here