Kristiyan Hristov and Dimitar Shterionov
Access to information is one of the key human rights in the modern world. However, information is most often distributed via a spoken language (either in textual or in audio format). This way of information dissemination is often limiting for deaf people. Deaf people communicate in Sign Language (SL) which are fully-fledged languages in the visual-gestural modality. In the Netherlands, the primary SL is the Sign Language of the Netherlands or Nederlandse Gebarentaal (NGT).
Automatic sign-to-text translation is a challenging task. First, there are limited data resources available. Next, data are often collected in the scope of different projects and do not always follow the same formatting or annotation protocols. Third, SLs exploit manual and non-manual features to convey meaning, hand and body movements and facial expressions, as well as the space around the signer, making SL recognition a computational challenge. In addition, SLs have no officially accepted written form making SL transcriptions inconsistent. In this work we explore the power of DL to learn multi-dimensional features without their explicit definition. We build end-to-end neural machine translation models to translate NGT utterances (in a video format) to text.
We build and compare several models based on the Encoder-Decoder architecture, where the Encoder reads in videos and the decoder generates text. In the encoder of the first model we construct we pass all the video frames (one by one) to a CNN module (VGG11) which extracts the spatial features of the input. Then we pass the sequence of extracted spatial features to an RNN module which captures the temporal dependencies between them. We use a simple RNN decoder (with attention) to translate the spatiotemporal information (extracted from the encoder) to our targets (sentences in Dutch).
For our second model we extract the body key points from the inputs using a body pose estimator such as MediaPipe or AlphaPose. This module replaces our CNN feature extractor from the first model. Then these key points are passed to the same RNN encoder in order to capture the temporal connections. The output from this is translated into text using the Decoder from the first model. That is, for this second model, we change how we extract the spatial features from our inputs keeping the rest the same. We hypothesise that a body pose estimator acts as a filter and removes noise and unnecessary information from the original videos, which aids the encoding, and consecutively the decoding (i.e. the translation). By doing so, we want to establish which approach is better for NGT-to-text translation tasks.