Let’s look at the original definition of the institution who first created a WordNet, Princeton University. According to the website:
WordNet is a large lexical database of English. Nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (called ‘synsets’), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
An example WordNet for the word ‘book’ is seen below:
With the definition and example, we can vision WordNet as something similar to a thesaurus, right? Since we can group words together based on their meanings. However, this is only partly correct.
WordNet is more than just a thesaurus. It also provides connections or interlinks which helps disambiguate semantically similar words (or sometimes called senses) which eventually forms a network. When we say disambiguate or disambiguation, we want to know what meaning the word is used in context. Most words definitely have multiple meanings based on how they are used. For example, when we see the word light in a sentence, does it mean light as in to illuminate, light as in to help us see, or light as in pale in color? These problems should be addressed if we want to have more powerful context-aware Natural Language Processing (NLP) systems such as chatbots, expert systems, etc.
In addition, WordNets usually provide labels to semantic relations of words. These labels are similar to the labels you see in taxonomic relationships but sometimes vary in form depending on how the WordNet is constructed. Synsets also contain glosses or a brief definition of each word. According to the Princeton website, word forms with several distinct meanings are represented in as many distinct synsets. Thus, each form-meaning pair in WordNet is unique.
Knowing the definition and use of WordNet, it would definitely boost NLP and linguistics research in the context of Filipino if we have (a) a WordNet for Filipino and (b) know how to effectively use it. But thanks to Sir Allan Borra, Dr. Adam Pease, Dr. Shirley Dita, and Dr. Rachel Roxas, an initial Filipino WordNet has been developed for use. The FilWordNet contains more than 10,000 synsets curated by the authors and this article will show you how to automatically extract those synsets, glosses, and relationships using Python.
To start, you have to download the materials needed in my GitHub link: https://github.com/imperialite/FilWordNetExtractor
The materials contain the FilWordNet published paper, FilWordNet files, as well as the Python notebook extractor. If you are going to make use of the files, please cite the paper.
Running the Python notebook should give you the first output. The words.xlsx file contains the wordid and the lemma of FilWordNet dictionary. A lemma is the canonical form / simplest form or a word. Overall, the dictionary contains 14,107 entry words.
The senses file will provide the following output. This data frame connects each word entry from the words.xlsx file to its corresponding synset using the synsetid value.
The synsets file will provide the following contents. It shows the synsetid of a word that can be matched to its corresponding gloss (definition). It also contains pos tag which stands for Part-of-Speech.
The sumo column stands for Suggested Upper Merged Ontology. This column provides what SUMO term can be used to map the word based on existing fundamental concepts in SUMO as seen in Figure 1 and Table 5 below.
In order to fully utilize the FilWordNet, we might need to merge all of them programmatically as it might be tedious to find synsets and glosses of words manually. Combining the three files will provide the following output:
As you can see, you can now easily match the definition, POS tag, and SUMO term for each lemma of the dictionary. It is normal for some words to have NaN or no SUMO term as the FilWordNet is still one of the first initial efforts in establishing a WordNet for the Filipino language. Aside from that, Filipino is morphologically richer than English. This means that we have words (usually verbs) that have various multiple forms.
Now that we have a merge WordNet file, let’s try queueing sample words and look at the output. Let’s try the word matino.
We see the one and only exact match based on the word/lemma itself and its corresponding POS tag, definition, and SUMO term. However, this is only a partial result. We know that the word matino has other several meanings based on context and use. Going through the notebook will give you an improved queue result.
This table now shows us the synsets or the cognitive synonyms of the word matino which contains several definitions and POS tag based on how it is used. This is the purpose of having a WordNet. You can try queueing other terms in Filipino such as:
Basically we have already seen the purpose of the FilWordNet in terms of disambiguating words and their meaning. But a WordNet is not complete without visualizing the network itself. There is a portion of the notebook which makes use of Matplotlib and NetworkX for visualizing queries using the FilWordNet. Let’s see some examples:
In Figure 3, we see the generated WordNet for the queried word. The nodes contain the various synsets of the word and a portion of its definition. It will not display the whole definition of the word itself since most definitions are very long and are not advisable to be visualized in a network graph.
Here are some other examples:
That’s basically it for this tutorial. I want to give my sincerest gratitude to the authors of the Filipino WordNet especially to Dr. Rachel Roxas for allowing me to open-source the resource files of the study. You can download the script from my GitHub page: https://github.com/imperialite/FilWordNetExtractor