Our Chatbot is based on Transformer model. Transformer is a deep machine learning model introduced in 2017, used in NLP area for various tasks such as machine translation and text summarization. The core idea behind the Transformer model is self-attention — the ability to attend to different positions of the input sequence to compute a representation of that sequence. The two components of Transformer are Encoder and Decoder. Let’s consider the images below, that come from Attention is All You Need paper and describe enumerated parts.
Role of Encoder is to encode the input sentence into context vector. If we have 5 tokens long sentence, then it’s context vector is Z = (z1, z2, z3, z4, z5). What is essential, each context vector has seen all the tokens on each positions within the input sentence — not like hidden state in RNN. Let’s dive into details.
- Tokens are passed through a standard Embedding Layer. There is no information about order of tokens in a sequence yet.
- In Positional Embedding Layer, input is position of the token in a sentence. Token and positional embeddings are summed. Before, token embeddings are multiplied by scaling factor, which is square root of depth size. That reduces variance. In the next step dropout is applied to the combined embeddings.
- Now the combined layers are passed through encoder layers. At the beginning source mask is used in order to mask the Multi-Head Attention mechanism. As we don’t really want the network to pay attention to the padding, we’re going to mask it. 0 is assigned to pad token and 1 otherwise.
- Multi-Head Attention is applied to the sentence itself (self-attention) in order to focus on the most relevant parts of the input sequence by calculating similarity score between each word and the rest of words in a sentence. In the next part of this article, the mechanism is explained on the example.
- Results from each head are combined and concatenated by weight matrix in order to keep the input vector size. The residual connection provides information about embedding position that was trained before. Dropout is performed and normalization is applied — each feature of a vector has a mean 0 and standard deviation 1, so transformer can be trained easier.
- Feed Forward increases the non-linearity and changes representation of data for better generalization over RELu function and dropout. Normalization is applied again and vectors are transformed back into initial representations. They are ready to serve as input for another encoder layer or decoder.
The purpose of decoder is to take the encoded representations of the source sentence and convert it into predicted tokens. It compares them with target sentence and calculates loss that will be used for further training.
- Combine Positional Embedding with embedded target tokens and dropout
- Masked Multi-Head Attention is applied. Mask is built from two components. The first component is padding mask, similarly to the encoder mask. The second component’s role is to make n-th token see only itself and the previous tokens, in order to prevent decoder from cheating by simply checking at the next token in a sentence. This look-ahead mask has a form of diagonal matrix, where the elements above diagonal will be 0 and the rest is 1. The two components are combined with logical and, creating a mask together. Attention layer is followed by dropout, residual connection and layer normalization.
- In the second Multi-Head Attention layer, we take queries from decoder and keys and values are the encoder representations. It is again followed by dropout, residual connection and layer normalization.
- Finally we pass through Feed Forward layer and another sequence of dropout, residual connection and layer normalization.
Crucial part of transformer is Self-Attention. Self-Attention algorithm takes Q (query), K (key), V (value) as an input and is used to generate attention weights (Wq, Wk, Wv), which are parameter matrices to be learned.
After finding Q, K, V we use the following equation that represents the multiplication of the attention weights and the value vector. This ensures that the words we want to focus on are kept as is and the irrelevant words are flushed out.
The dot-product attention is scaled by a factor of square root of the depth. This is done because for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients resulting in a very hard softmax.
Transformer uses N parallel attention layers (Multi-Head Attention) with a difference that they are scald by head dimension = depth/N. For each head, Q, K, V matrices are initialized randomly and modified independently by backpropagation during training. As a result, the embedding takes into consideration various contexts at the same time.
Since Encoder, Decoder and Multi-Head Attention are already defined, we are ready to combine everything and build Transformer!
- Apply the same preprocessing method we used to create our dataset for the input sentence.
- Tokenize the input sentence and add padding.
- Calculate the padding masks and the look ahead masks.
- The decoder then outputs the predictions by looking at the encoder output and its own output.
- Predict word based on probabilities (argmax) and pass it to the decoder.
Note that in this approach, the decoder predicts the next word based on the previous words it predicted.
We have implemented Chatbot in TensorFlow 2.0 and understood Transformer architecture! Thank you for reading :).