Felix van Deelen, Sandro Pezzelle and Tom van den Broek
Current image captioning models perform well for describing generic images but generate captions lacking in specific information when applied in the context of news where named entities are frequently used in captions. Several methods have been proposed to improve current state-of-the-art image captioning models for the context of news by applying entity-aware caption generation. However, most of these methods are applied to English and lack evaluation in other languages. In this work, we evaluate entity-aware image captioning models on a novel data set of Dutch news.
The data set we introduce contains 350,000 images with captions and their usage amongst a set of 400,000 articles from the Dutch Public Broadcasting Organisation (NOS). The captions were supplied by editorial teams at the NOS who are experts on the topics covered in the articles, resulting in high quality and information dense captions. Next to image caption pairs, the data set contains article text and is rich in metadata. The articles in the data set, which we henceforth refer to as NOS, cover a broad range of topics in news and sports for the period from 2015 to 2021.
We first analyse the properties of NOS by comparing it to (i) a standard image captioning data set (COCO) and (ii) a data set of news in English (GoodNews). Our analysis showed that NOS and GoodNews captions are more similar to each other than they are to COCO. In particular, the majority of NOS and GoodNews captions contain named entities, while the opposite holds for COCO. This reveals that a standard image captioning model might not be suitable for NOS and in alignment with GoodNews may benefit from entity-aware caption generation.
We apply several methods for the task of image caption generation on the data set. We initially experiment with (1) a standard state-of-the-art image captioning model based on the bottom-up top-down attention architecture, which only leverages information from the image. We then use (2) an extension to this model incorporating information from the article text associated with the image, based on the hypothesis that this will incorporate contextual information into the caption that is not necessarily present in the image. Finally, we experiment with (3) an extension of the model where generated nouns and noun phrases are mapped into named entities using a corpus-specific module. This is hypothesized to generate captions containing names of entities present in the image.
The models are trained on NOS as well as translated captions from the COCO data set. A quantitative evaluation is performed using standard image captioning metrics as well a F1 score which assigns more weight to named entities. Qualitative evaluation is performed by members of the editorial team at NOS.
Our hypothesis is that the entity-aware image captioning model will outperform the standard image captioning model when applied to our data set. The project is currently ongoing and we only have preliminary results. However, we expect to be able to present all results at the time of CLIN32.