Martijn van der Klis
Using too many complex words potentially hinders readability. Readers should ideally know between 95% and 98% of the words in a text to adequately comprehend it (Laufer and Ravenhorst-Kalovski, 2010; Nation, 2006). On internet pages, where visitors can from virtually everywhere, this issue is elevated, as we are unaware of the user’s readability level.
Moreover, on the web, resources are limited if we aim to process text client-side rather than server-side. Processing in the browser is necessitated when confidentiality of the data is at stake. Conversely, such an approach means that we are unable to call external parsers or rely on large neural networks. In this research, we therefore investigate which surface-level, word-based measures we can use to adequately classify complex words.
As our dataset, we use the NAACL 2018 Complex Word Identification task dataset (Yimam et al., 2017). This dataset is available in four languages: English, Spanish, German, and French. The dataset consists of target words (potentially a multi-word expression) within a sentence that are classified by human annotators (a mixture of native and non-native speakers) as either complex or non-complex (simple). An example sentence from the dataset is:
(1) _Syrian_ _troops_ *shelled* a *rebel-held* _town_ on _Monday_, *sparking* *intense* _clashes_ that sent *bloodied* *victims* *flooding* into _hospitals_ and _clinics_, *activists* said.
In (1), the *starred* words are considered complex because at least one annotator considered them complex. For the _underlined_ words, no annotator marked the word as complex, and these words are consequently considered simple.
In our research, we show that for English, a classifier based on solely 3-gram character probability reaches a competitive F1-score (.738). Presumably, such a classifier works well because n-grams can discern between common morphological features and uncommon character patterns (e.g., in loan words). In German, Spanish, and French, creating a cut-off on word length and then checking longer words for their appearance in the top-5000 words of a frequency list (based on OpenSubtitles2018) yields the best F1-scores (German: .755, Spanish: .734, French: .738).
In all languages, a logistic regression model which considers all these variables performs slightly better than the “simpler” models outlined above. The F1-score we reach for English (.786) is comparable to the winning approaches to the Shared Task (Yimam et al., 2018). For German (.762) and Spanish (.770) they relate to the F1-scores obtained in other cross-lingual models (Finnimore et al., 2019). Nevertheless, the state-of-the-art approach based on sequence modelling (Gooding and Kochmar, 2019) clearly outperforms our solution for English.
In future work, we aim to combine complex word detection with providing simpler alternatives, i.e., lexical simplification. Moreover, we intend use this measure to score texts according to CEFR levels (A1, A2, B1, etc.).