So let’s try to answer the first question and check the distribution of each personality in the dataset.
It seems that there are many more introverts than extraverts in the world… Hmm after second thought that distribution doesn’t seems right. Let’s check statistics from the authors of the test.
That’s interesting! The most popular traits on the above table seem to be ISTJ and ISFJ with scores 11,6% and 13.8%. These results are completely different from those obtained by me when counting the distribution in the Kaggle dataset, where those two types are represented by around 2–2,5% people.
We see that the discrepancies apply to virtually all types. It looks like people with types INFP, INFJ, INTP and INTJ are most likely to post on a personality types forum.
Moreover when we recreate the table on the left in the provided image we can see that it is completely different as well. People with letters I, F and P in their type acronyms will be overrepresented. And in our future analysis we must remember that this data is imbalanced.
The conclusion that automatically comes is that introverts are more focused on analyzing their personality rather than more sociable extroverts. For this reason, we can expect that they will be more likely to do personality tests and speak about themselves in forums guaranteeing such anonymity. It is not strange that more emotional people (letter F) will write about themselves more often. Especially when it concerns a connection with introversion, which often makes it difficult for others to express themselves.
As we have seen before data is a little bit messy. It contains a lot of mixed case letters, punctuation marks, links and so on. Before we start any analysis we should clean it up. What I did was relatively simple and consisted of:
- Removing links.
- Removing all digits and punctuations.
- Lowercase all letters.
- Removing stop words.
- *At first I used lemmatizing, but it resulted in a significant reduction in accuracy, so in further analysis I abandoned it.
- Replacing every word with numerical representation
All of these points are quite classical NLP pipeline. I will not discuss it in detail, because this post could grow to horrendous sizes and bore less technical readers. If you want to know more, on Medium you will find lots of fantastic articles describing natural language processing pipelines.
Data prepared this way was ready to fit in into a machine learning classifier. I’ve decided to use the Support Vector Classifier with a linear kernel because in addition to high efficiency when classifying text data it enabled me to obtain features important (it’s impossible to achieve with different kernels).
During training I chose four different classifiers that were supposed to decide which of the attributes could be assigned to the author, that means (I)ntroverts or (E)xtraverts, (J)udgers or (F)eelers and so on.
After the first training, the results were satisfying… Suspiciously satisfying. But full of complacency I decided to go on and try to answer which of the words are most significant for each trait. And there are results.
Sigh… I forgot to remove type indicators in posts! We can see that for our classifier most valuable in deciding if the author is introvert (highest bars in the blue part of the plot) were words with personality type itself. And on the other side, most valuable words used by extraverts (highest but negative bars in red part) are this one with ‘e’ in four-letter code. That’s not fair! Let’s go back to our feature engineering process and remove these sneaky words.