Quinten De Neve, Liesbeth Allein, Victor Milewski and Marie-Francine Moens
Do not use BERT to order anything in a Belgian restaurant, it may not know some of your favorite dishes from the Belgian cuisine. These BERT models are trained on huge corpora containing millions of Dutch words, however these corpora lack words of standard varieties of the Dutch language and thus do not contain “waterzooi” or “videe”.
BERT models have a rich research history, ranging from their applications and their performances to implicit or explicit biases lying within these models. Datasets and models have been analysed on bias often: hate speech detection and racial bias mitigation in social media, investigation and mitigation of gender bias, mitigation of ethnic bias and so forth. Even in the Dutch language, gender bias in word embeddings has been researched and mitigated. However, there is a clear gap in research when it comes to Dutch and it standard varieties (Flemish, Surinamese, …) and whether any bias lies within BERT models when using a standard variety rather than actual Dutch. This is where this thesis tries to shrink this gap.
Dutch is a pluricentric language which means that it has multiple standard forms, including Flemish, Surinamese and the shared Dutch variant. This entails that there are significant differences between those variants in grammar, vocabulary and even culturally. There are certain words in specific variants that have no counterpart in the other varieties as they are purely culture based. Examples of such words include “Chiro”, a Belgian youth movement and “Koningsdag” which is the national holiday in the Netherlands. In the odd case, there are also words that exists in multiple varieties but have different meanings in those varieties. A prime example of such a word is the verb “lopen” which translates to running/walking in flemish and solely walking in dutch.
More formally, the goal of this thesis is to examine whether or not state-of-the-art language models exhibit a bias towards standard varieties of the Dutch language. More specifically, BERT models are researched and tested upon the Flemish and shared Dutch variant.
As a first step, we take a look at the contextualized word embeddings of different Dutch and Flemish words in multiple contexts. Then various metrics are calculated and it is checked whether these word embeddings have a linguistic bias. Secondly, masking and next sentence prediction are used to originally train the BERT model and thereby is it crucial to investigate whether or not these tasks induce a bias. BERT has no recollection of some flemish words and it is thus to be expected that when these are masked out, the results may lie far apart from the original meaning of the sentence. Expanding on this, the models are also tested on its results on the sentiment analysis task. As a last step, we try to find a way to mitigate this bias. The results of this thesis can be further used in reducing bias in language models for pluricentric languages and thus improving the quality and performance of BERT models in multiple languages.