Ine Gevers, Ilia Markov and Walter Daelemans
The increased popularity of online communication platforms entails a profound interest in the automatic detection of socially unacceptable discourse (SUD). Regarding this issue, linguistic analysis is indisputably important because it improves the understanding of this problem, which could lead to more robust SUD detection systems. In this paper, we compare linguistic features of Dutch SUD to non-SUD. We focus on three research questions investigating the differences in average length, lexical diversity, and linguistic standardness. This analysis was performed on the LiLaH dataset (https://lilah.eu), which contains over 36,000 SUD and non-SUD Facebook comments about the LGBT-community and migrants [3].
Our results show that SUD comments differ from their non-SUD counterpart in all features. Comparing the median length, SUD comments tend to be longer. Regarding the type to token ratio and the content words to function words ratio, we observed that both ratios are higher for non-SUD comments. Furthermore, the relative frequency of emojis is higher for non-SUD. These observations indicate that the lexical diversity is larger for non-SUD comments.
Regarding linguistic standardness, our analysis was performed by examining a selection of features discussed in [2]. In both discourse types, the majority of the tokens are standard Dutch (95%), with only a small percentage (1%) being (influenced by) English. In the non-standard tokens, we compared SUD to non-SUD. First, the punctuation to non-punctuation ratio is slightly higher for non-SUD comments. Second, the rate of character flooding is higher for SUD comments. Similarly, we observed that SUD comments combine exclamation marks and question marks more often than the non-SUD comments. Third, there are more cases of unconventional capitalization in SUD. These observations are not surprising, since they all contribute to the expressiveness of a comment. Lastly, we noted more instances of laughter in non-SUD comments.
Additionally, we are able to partially compare the results to those obtained by a similar study conducted on the Slovene language [1]. We conclude that while the average length of the comments and the relative frequency of emojis are similar, the type to token ratio and content word to function word ratio are opposite across the two languages. Furthermore, the number of unique emojis used in the two languages differ significantly. This suggests that there are commonalities but also certain differences in the linguistic landscape of SUD across the two languages.
REFERENCES
[1] Kristina Pahor De Maiti, Darja Fišer, and Nikola Ljubešić. Nonstandard linguistic features of slovene socially unacceptable discourse on facebook. Fišer, D., & Smith, P. The Dark Side of Digital Platforms: Linguistic Investigations of Socially Unacceptable Online Discourse Practices, pages 12–35, 2020.
[2] Lisa Hilte. The social in social media writing: the impact of age, gender and social class indicators on adolescents’ informal online writing practices. PhD thesis, University of Antwerp, 2019.
[3] Jens Lemmens, Ilia Markov, and Walter Daelemans. Improving hate speech type and target detection with hateful metaphor features. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, pages 7–16, 2021.