Martha Larson, Javier Martínez Rodríguez, Mark Wijkhuizen and Onno Crasborn
Computer vision has made impressive progress in the area of sign language. Signs can be recognized in video recordings of signers and it is possible to compute the positions of recurring signs in a long stretch of video. We argue that in order for computer vision to reach its full potential in the area of sign language, it is necessary to formulate conceptualizations of similarity. We define conceptualizations of similarity as the conditions that hold when a signer watches two videos and judges them to be similar to each other in a way that is relevant for producing, understanding, or learning sign language. The judgment might hold for phonemes (sub-signs), signs, sentences, or entire stretches of discourse. Using carefully thought-out conceptualizations of similarity will help to move research in the area of computer vision for sign language away from treating sign language as generic motion or human action to focusing on aspects that make sign language distinctive.
The talk will first discuss conceptualizations that are used in computer vision research and then define conceptualizations of similarity and explain why they are important in order to bridge computer vision and sign language. Then, it will describe two cases in which computer vision is applied to Dutch Sign Language (NGT) that illustrate why the conceptualization of similarity is important: sign spotting and a visual dictionary for sign language learning. Both cases focus on similarity at the sub-sign and the sign level.
Sign spotting takes a video of a sign as input and returns similar positions within a longer stretch of video. In computer vision terms, sign spotting is a video-to-video matching task or a content-based video retrieval task. In the study we present here, we use conceptualizations of similarity in order to gain insight into the reasons for the match. Specifically, we are interested in whether humans and computers identify matches for the same reason. Our study allows us to gain insight into the answer of the larger question, “Does computer vision learn the linguistics of sign language?”
A visual sign language dictionary takes a video of a sign as input and returns a translation. In the study we present here, the visual diction is based on an investigation of how NGT learners perceive similarities between signs. The resulting conceptualization of similarity is used to implement the visual sign language dictionary. The visual dictionary outputs a list of possible matches, which provide the definition of the sign. The dictionary also attempts to leverage the conceptualization of similarity in order to output matches that are confusable signs. In this way, the learner can understand not only the target sign, but other “false friends” which should not be mistaken for the target sign.
We close with an outlook on conceptualizations of similarity beyond the sign level and a reflection on why our work can be considered a concrete direction of research that will help to ensure that computer vision treats sign language as a language, and not a generic example of human action.