All that Glitters Is Not Gold: Exploring the Use of Silver Data in Transfer-Learning for Dutch Offensive Language Detection

Dion Theodoridis and Tommaso Caselli

Online interaction can be pleasant, however, it has a downside: the presence of socially unacceptable language. Given the current amount of data that is produced, automatic solutions to support social platforms to ensure a safe and inclusive environment are required (Nobata, 2016; Vidgen et al., 2019). In this contribution we focus on a specific phenomenon, offensive language, in Dutch tweets directed at Dutch politicians. In particular, we investigated the benefit of applying transfer learning to boost the development of a newly annotated dataset and models on a new language.

Our starting point is a multilingual Twitter dataset, the Political Speech Project (PSP; Brockling et al., 2018). PSP contains messages from Twitter and Facebook directed at 320 politicians in 4 countries collected between February and March in 2018. Out of the 37,723 messages, only 809 are annotated as offensive. In a set of preliminary experiments on the PSP dataset, we found that by upsampling the offensive messages and resting balanced distribution between the offensive and not-offensive messages, we obtained the best macro F1 score (0.774) using mBERT (Pires et al., 2019).

We then applied this model to a collection of 992,192 tweets in Ducth from March 2021, containing at least one mention of a Ducth politician. To evaluate the performance of the PSP-trained model on Ducth, we manually corrected 1,500 messages (test set). The remainder of the data automatically annotated (silver data) has been used to develop new language specific models.

Using the silver data, we extensively tried different training settings, in particular: i) concatenate Dutch silver data to the PSP data; ii) continue to fine-tune the PSP trained model with Dutch silver data; and (iii) concatenate Dutch silver and manually annotated data from DALC (Caselli et al., 2021). In the last setting, data comes from Twitter, are annotated for offensiveness but do not specifically target politicians.

When it comes to the evaluation on the test set, we have identified a lower-bound macro F1 of 0.629 by directly applying the PSP trained model (zero-shot setting) and an upper-bound macro F1 of 0.737 when fine-tuning a monolingual pre-trained model (BERTje; de Vries et al. 2019) using the DALC training. When it comes to the contribution of the silver data, in all training settings the results are disappointing. In the first setting, concatenate same amount of silver data as the PSP data, the newly fine-tuned mBERT model reaches a macro F1 of 0.619 (-0.01 when compared to the zero-shot model); when we keep fine tuning the initial PSP trained model, mBERT best results are 0.634, an increase of only 0.005. Finally, concatenate monolingual gold and silver data brings the macro F1 to 0.691, a drop of 0.046 points.

On a different note, we observed that transfer-learning can speed up the manual annotation making it easier to create new datasets for different target phenomena. In our test sample of 1500 messages, the model’s predictions had to be corrected only 498 times (one-third).