Eda Saral and Raquel G. Alhama
Starting from March 11, 2020, World Health Organization’s announcement of the pandemic, concerns escalated about risks of not only the physical health, but also the mental health of individuals, since governments implemented restrictions to decrease chances of the virus spreading, including home confinement. Restrictions resulted in changes in the daily activities such as eating habits, physical activity and sleeping schedule, which ultimately created a risk of engaging in eating disorder (ED) behaviors, exacerbated with increased media exposure, isolation, fear of contagion, increased anxiety and emotional distress.
Natural Language Processing techniques have been key to finding out more about the effects of COVID-19 to the individuals with EDs who engage in relevant communities in Reddit. There are, however, some absonous findings within the same time period: some studies found that discussion around ED symptoms was less prevalent compared to pre-pandemic posts (Nutley et al. 2021, JMIR Mental Health; Shields, 2022, J. Eat Disord.), while others suggested increased symptoms (Low et al. 2020; JMIR) and an increase in the prevalence of anxiety words, as well as a shift toward treatment related topics (Feldhege et al. 2021; JMIR). In addition, all this work focused on the first year of the pandemic, hence the effects of the pandemic in ED behavior (as reflected by conversation in online ED communities) is unknown for the second year.
The goal of our ongoing study is to look into the above-mentioned discrepancies and extend the analysis of conversation of ED communities in Reddit for the larger period of time of the COVID-19 pandemic (i.e. including the second year). To this aim, we investigate conversation topics using three topic modeling approaches that differ in their relative strengths. Consistent with prior studies, the first model we focus on is Latent Dirichlet Allocation (LDA), an already classic model that discovers topics based on clustering distributional information. While useful for comparison with related work, we expect this model to exhibit a shortcoming due to data sparsity in short Reddit posts. The second model we apply is BiTerm Topic Modeling (Cheng et al., 2014; IEEE Trans. Knowl. Data Eng.), which has been developed precisely to deal with data sparsity in short texts by using aggregated word co-occurrence patterns to boost topic learning. Finally, we compare these models to Top2Vec (Angelov, 2020; arXiv), a recent model that can deal with short texts, and also tackles some weaknesses of LDA by using jointly embedded topic and word vectors, while not requiring stop word lists, stemming and lemmatization.
With our work, we expect to find (1) which of these models represents more robust topic segregation over Reddit data for the ED community, and (2) which are the main topics discussed over key time-periods (pre-pandemic, 1st year of pandemic and 2nd year of pandemic). Our results should shed light both on the effects of the pandemic on EDs, and on the relative advantages of the compared topic models for the analysis of data from online communities in Reddit.