GaLAHaD: towards better linguistic annotation for historical Dutch – CLIN32

Tim Brouwer, Katrien Depuydt and Jesse de Does

Historical texts are essential source material for both linguistic and digital humanities research. Adding linguistic annotation to historical text corpora helps to make the data more accessible. Users need not be concerned with historical spelling variation, and can query or analyse the data using higher-lever categories like part of speech and other grammatical properties.

Unfortunately, automatic linguistic annotation of historical Dutch in all its diversity still remains a challenge. Work has been done in several projects, both national and international, but results are fragmented, mutually incompatible, and far from providing a completely satisfactory solution.
We address this problem in the CLARIAH+ task Infrastructure for historical Dutch, by defining a tagset* applicable to all phases of historical Dutch, with mappings to the tagsets used in existing historical and modern corpora, by harmonizing and extending training and evaluation material, and by developing an online platform for historical corpus building and deployment, consisting of the Autosearch corpus exploration environment, the CoBaLT tool for manual correction of linguistic annotation and the GaLAHaD (Generating Linguistic Annotations for Historical Dutch) application for the deployment and evaluation of various approaches to automatic linguistic annotation.

The current presentation focuses on the GaLAHaD application. The application serves two purposes. One is to make annotation and tool evaluation easily accessible to researchers, the other to make it easy for developers to contribute their tools and models in the platform, and thus compare them to other tools with gold standard material included in the platform.

GaLAHaD is designed to enable end users to choose the optimal path for their material. Apart from the basic task of uploading and annotating corpus material, GaLAHaD provides options to inspect and evaluate the result of the annotation process, in order to raise the awareness of typical errors and biases in the tools. The functionality of comparing annotation layers enables users to assess the accuracy of different tools on their data. It can be used both to evaluate a layer added by an automatic tagger with respect to a gold standard reference layer, or to compare layers added by different taggers. Disagreement between layers is not only represented by global statistics, but also illustrated by examples which are immediately visible in the tool. The annotated material can be automatically uploaded to the Autosearch corpus exploration environment and to the CoBaLT tool for manual correction of linguistic annotation.

For tool developers, the docker-based application architecture ensures easy contribution of tools to the platform. The application and taggers are hosted by the INT and accessible with any CLARIN-account. There is also the option to self-host an instance using the publicly available docker images from the INT docker hub or the open source code available on GitHub.

In its current form, GaLAHaD provides Frog with modern and historical models, the PIE framework developed by Enrique Manjavacas and Mike Kestemont, with models for Middle Dutch and early modern Dutch, the INT historical tagger and the RNNTagger developed by Helmut Schmid.

*) Tagset voor Diachroon
corpusmateriaal van het Nederlands (TDN). https://ivdnt.org/wpcontent/uploads/2021/05/TDN_INT_WP_1.pdf