When developing an NLU one of the big decissions is how to implement the tokenizer and the stemmer.
The tokenizer defines how to split sentences into words. The problem is that exists different approachs to do it. Imagine the sentence “You can’t”, some split it into [you, can, t], others into [you, ca, n’t] and others into [you, can, not].
The stemmer is how to calculate the stem or root of a word, i.e, if you have the words “developer”, “developed”, “developing”, “development” is easy to detect that are derived words from the stem “develop”.
Some of the commercial products decide not to use stemmers. Is this a good decision or a bad one? Let’s measure the impact on a real project, measuring the percentage of errors of the NLU over real users sentences using the chatbot.
The easier tokenizer is to simply split by words using the regular expression W+.
The Treebank Word Tokenizer is commonly used, and goes further, identifying the aposthrophe when splitting.
The Aggresive Tokenizer is the one implemented in NLP.js, and identifyes the apostrophes, but the result is different than the Treebank Word Tokenizer.
Let’s see how this three tokenizers work:
During this article we will analyze the different figures using these tokenizers and different stemmers. Right now let’s see the amounts of errors (remember, using real data from users) without using stemmer
Using an stemmer for an NLU of a chatbot has two objectives:
- The first one is to understand different ways of the users to ask the same question. Example, if you train your chatbot with “who is your developer”, you want that expressions like “who is developing you”, “who is in charge of your development”, “who developed you”, “who are your developers”,… match with the one used in the training.
- The second one is to reduce the number of features of the Neural Network used for training, that means less variables to be calculated so more accurate training in less time.
In English there are 3 famous stemmers:
- Lancaster: Is the most aggressive one, the words become short. Can be good for really big text datasets.
- Porter: Developed by Martin Porter, was the most used one during lot of years.
- Porter2: Martin Porter updated it’s stemmer it with the Porter2, and designed a language to support stemmers of other languages, called Snowball. That’s the reason why the Porter2 stemmer is also known as Snowball stemmer.
Amount of errors with Lancaster stemmer and different tokenizers, remember that lower is better:
Amount of errors with Porter stemmer and different tokenizers, remember that lower is better:
Amount of errors with Porter2 stemmer and different tokenizers, remember that lower is better:
Train an intent “developer” with only the utterance “who is your developer”. Then check those:
- Who are your developers
- Who developed you
- Who is developing you
SAP Conversational AI does not seems to be using stemmers.
Microsoft LUIS is also not using stemmers. The results are under the 0.5 threshold, probably due to the weight of the word “who”.
DialogFlow is also not using stemmers
IBM Watson seems the one using stemmers of the big commercial ones:
We have seen in the stats that the impact in english is huge, we go from a 26.83% of errors down to 3.66% of errors in NLP.js only by using the proper tokenizer and stemmer.
In english the derived forms from one word are not so much: “develop”, “develope”, “developed”, “developer”, “developers”, “developing”, “development”.
But think in other languages, like spanish, where the word “develop” is “desarrollar” and the derived words are: “desarrollo”, “desarrollas”, “desarrolla”, “desarrollamos”, “desarrolláis”, “desarrollan”, “desarrollado”, “desarrollador”, “desarrolladores”, “desarrollaba”, “desarrollabas”, “desarrollábamos”, “desarrollabais”, “desarrollaban”, “desarrollé”, “desarrollaste”, “desarrolló”, “desarrollando”, “desarrollamos”, “desarrollasteis”, “desarrollaron”, “desarrollaré”, “desarrollarás”, “desarrollará”, “desarrollaremos”, “desarrollaréis”, “desarrollarán”, “desarrollarías”, “desarrollaría”, “desarrollaríamos”, “desarrollaríais”, “desarrollarían”,…
We can think in more than 40 different words derived in spanish, over the 7 in english, so the impact in spanish will be bigger.
In the other article about NLU benchmarking, we had those figures:
If we remember the errors in the test data, we see that those are the results sorted from best to worst:
- NLP.js: 3 errors (3.66%)
- IBM Watson: 6 errors (7.32%)
- Microsoft LUIS: 18 errors (21.95%)
- Google DialogFlow: 19 errors (23.17%)
- SAP Conversational AI: 20 errors (24.90%)
What will happen if we pre-stem all the corpus before training Microsoft LUIS? I mean, instead of training Microsoft LUIS with the training sentences, train it with the stems of the training sentences. And then when calling LUIS with a sentence from a user, use an stemmer before passing it to LUIS.
As we can see, the improvement is amazing, a great one in the corpus (exact sentences from the training data) and the amount of errors with real data with users is halved. Is still failing more than NLP.js in both corpus and tests, and failing more than IBM Watson in tests, this can be due to the different AI algorithm used inside each NLU.
It seems that the use of the proper tokenizer and stemmer for each language improves a lot the accuracy for the NLU. We can see that, using NLP.js, we pass from a 26.83% of errors (70.17% accuracy) to 3.66% of erros (96.34% accuracy).
We observe that the 2 bests ones working with real data are the 2 that do stemming: NLP.js and IBM Watson.
In the experiment of pre-stemming the sentences before training with providers that does not do stemming, we can see that also improves the results. But in my humble opinion, is the NLU provider the one that should handle the language complexity, and not the development team using the NLU the one in charge of filling the gaps of the NLU provider.