Preliminary annotation experiments on Medieval Greek

Colin Swaelens, Ilse De Vos and Els Lefever

The Greek language has a written tradition of almost 3,000 years, a unique and incomparable resource in Europe for linguistic research. Both the Ancient and Modern Greek language have been extensively researched, whereas the 1,000 years lying in between were of much less interest to linguists. A case in point is the first comprehensive grammar of Medieval Greek, which was only published in 2019 (Holton et al. 2019).

Similarly, although several (annotated) corpora and natural language processing approaches have been developed for both Ancient and Modern Greek, only very few corpora of Medieval Greek exist. To our knowledge, GREgORI corpus is the only one to be annotated, i.e. lemmatised (Kindt 2004). Since Kindt, the first NLP initiative for Medieval Greek was undertaken by Singh and colleagues, who developed both a BERT-based language model for Ancient Greek and a morphological analysis tool for Ancient as well as Medieval Greek (Singh et al. 2021).

However, this tool has not been applied yet to original, unedited Medieval Greek data. Precisely that type of data is stored at the Database of Byzantine Book Epigrams (DBBE, https://www.dbbe.ugent.be), an ongoing project at Ghent University disclosing over 10,000 epigrams. These epigrams are stored in both their original, unedited form as they are found in manuscripts (so-called occurrences) and in a standardised version (so-called types). The original epigrams are full of orthographic mistakes and many times lacunae occur as well, which makes it hard to automatically perform morphological analysis. Although this might be challenging, we will develop a pipeline to linguistically pre-process these Medieval Greek texts, i.e. provide every token with a part-of-speech tag, complete morphological analysis and lemma.

In this paper, we report on the preliminary annotation experiments: a label set and annotation guidelines were set up and validated by means of an inter-annotator agreement experiment. For the annotation, the data set was pre-annotated by the morphological analysis tool of Singh and colleagues and needed to be reviewed, corrected and provided with the correct lemma for every token. The experiment was carried out by three annotators on 1,022 tokens, which yielded an agreement score of 92%, using Cohen’s kappa. As a next step, we will retrain the BERT language model of Singh and colleagues on the original epigrams from the DBBE, since their version was trained only on Ancient Greek texts. Once that is completed, we have a solid basis to start developing the pre-processing pipeline for Medieval Greek.