A corpus for training Sign Language Recognition and Translation: The Belgian Federal COVID-19 Video Corpus – CLIN32

Vincent Vandeghinste, Bob Van Dyck, Mathieu De Coster and Maud Goddefroy

We are presenting the Belgian Federal COVID-19 corpus, nicknamed the COV19.be corpus. It consists of the entire archive of official press conferences from the Belgian Federal Government concerning the COVID-19 pandemic. The corpus consists of 220 video files in mp4 format, as downloaded from the website https://news.belgium.be/nl/corona. The video’s contain press conferences from, amongst others, the Prime Minister, from the Consultation Comittee (Overlegcomite), the Federal Government Service on Public Health, the National Crisis Centre, the National Security Council. The speakers speak mostly in Dutch or French and occasionally in German, and nearly all speech is accompanied by a deaf signer who performs live interpreting from what is being said. Based on the language identification and ASR output, we estimate the size of the corpus to be more than 620000 tokens of Dutch speech.

Following automatic annotations are being added: sign language feature extraction, language identification, speech transcription for Dutch speech, speaker diarisation and segmentation.
Sign language feature extraction is performed using MediaPipe Holistic: we extract 75 3D keypoints per frame in every video. These keypoints correspond to the upper body and hands.

As we are not aware of any free and publicly available language identification tool for speech language identification, we have passed the entire speech through a Belgian Dutch ASR system (Van Dyck et al. 2021), which provides us with raw transcripts. In order to determine speaker changes we have applied speaker diarisation (Bredin et al., 2019), which is available as a CLARIN service from the Bavarian Archive of Speech Signals.
On the output of the ASR we applied language identification, using information from speech interruptions, and confidence metrics of the ASR system. Speaker diarisation could not be used as a cue for language identification as language switches often occur within a single speaker turn.

As the videos contain two sign languages (Flemish Sign Language, or VGT, and French Belgian Sign Language, or LSFB), we first detect which parts of the videos contain VGT signing. We have manually labeled interpreters as “VGT” and “LSFB”. Then we process the videos to automatically label video frames as belonging to either language. We compute neural embeddings for the faces of the interpreters using FaceNet (Schroff et al., 2015). We then detect a switch between two people whenever the simple moving average of the Euclidean distance between face embeddings exceeds a threshold. A new video segment is created at every switch. These segments are then manually labeled as VGT or LSFB and consecutive segments with the same label are merged.

The automated audio annotations are available as ELAN files (Wittenburg et al., 2006), and the sign language features as compressed binary NumPy files.
We intend to make this corpus and its annotations available for download through the CLARIN infrastructure at the Dutch Language Institute, and will investigate whether online access and search can be made possible.