Mirella De Sisto and Dimitar Shterionov
One of the challenges of NLP for Sign Languages (SLs) is the difficulty of segmenting the sign stream into individual signs (Yin et al., 2021). In this paper, we propose a novel approach to sign segmentation by exploiting the time-stamps of glosses. We apply this method to the annotations of the Corpus Nederlandse Gebarentaal (Corpus NGT) (Crasborn et al., 2020).
Despite the fact that signs and words play a similar grammatical role (Brennan, 1992; Leeson and Saeed, 2012), and are situated at an equivalent level of organisation in the language (Zeshan, 2007), they are not fully equivalent: signs are organised in a more simultaneous way than words (Stokoe, 1960), and co-articulation makes their boundaries within a continuous stream blurred (Yin et al., 2021; De Sisto et al., 2021). In addition, the lexicon of SLs is partially composed by productive signs, which are strongly context-dependent and do not have a conventionalised form (Johnston and Schembri, 1999; Vermeerbergen and Van Herreweghe, 2018; Belissen, 2020). These properties of signs constitute great obstacles to sign identification and to the detection of sign boundaries within a continuous stream (De Sisto et al., 2021).
Machine translation (MT) is a sequence to sequence task. In the context of MT, SL videos are segmented into sequences of still frames (De Coster et al., 2022), turning a sign-to-spoken MT into a frame-to-token task. However, there are certain issues when it comes to frame-level encoding: (i) frames are segments of very high-granularity, leading to very long sequences and (ii) they carry no syntactic nor semantic information that can be exploited during translation. To facilitate MT for sign-to-spoken languages a more sophisticated and linguistically motivated segmentation is needed. While segmentation approaches for SL have been previously proposed (Khan, 2014), there are no clear criteria nor guidelines that can be easily employed in an automated translation system (De Sisto et al., 2021).
In SL corpora, such as the Corpus NGT, videos are typically annotated with glosses and text. Glosses in SL corpora constitute lexical representations of the articulated signs. Since to this date there is no standard SL transcription system which is extensively used, glosses are the major form of annotating SL data and the main resource for NLP. The text is a translation of the SL utterance in a spoken language; glosses capture the semantics of signs and are aligned with the sign they correspond to. As such, glosses not only convey information about the meaning of signs but also about their duration and we hypothesise that they can be used as markers to build automatic models for sign segmentation. To test this hypothesis, we propose deep learning encoder-decoder for sign language segmentation. We consider two approaches: (i) sign-to-gloss models in which the segmentation is driven only by glosses and (ii) sign-to-(gloss-and-text) where segmentation is driven jointly by glosses and text. We test our approach on the Corpus NGT.