Design principles of an Automatic Speech Recognition functionality in a user-centric signed and spoken language translation system

Aditya Parikh, Louis ten Bosch, Henk Van den Heuvel and Cristian Tejedor-García

The European project SignON aims at designing a user-oriented and community-driven platform for communication among deaf, hard of hearing, and hearing individuals in both sign language and spoken languages (i.e. English, Dutch, Spanish, and Irish). Inclusion, easy access to translation services and the use of state-of-the-art Artificial Intelligence (AI) are the key aspects of the platform design. Users can communicate to the system with text, speech, and sign language through video, while the system can respond using, for instance, (translated) output, subtitles, translated audio via speech synthesis, and via a 3D avatar.

In this framework, the design of a flexible, user-friendly component for Automatic Speech Recognition (ASR) is a challenge, due to the constraints imposed by the platform, in terms of usability and the use of system-external services. The project addresses the current state-of-the-art and outstanding questions for language technology in Western Europe.

Our presentation will discuss the conceptual choices underlying the design, operation, and integration of the ASR component. ASR will convert received audio (oral messages) into its textual form in the source language using state-of-the-art ASR technology, which will be tuned to the use cases at hand and the speech articulated by the speaker (including an atypical speech from deaf speakers and the speakers with cochlear implants). The ASR component (i) addresses privacy challenges by secure interaction of speech data, (ii) fits the communicative setting by adapting communication channel characteristics, and (iii) is readily extensible to new languages and data. We will discuss the data selection procedure for building the acoustic models, lexicons, and language models for the ASR component. One of the languages we deal with is also Irish. Since this is an under-resourced language, a transfer-learning approach is suggested and acoustic models will be borrowed from other languages, whereas the language model will be compiled from existing texts.

We will briefly outline how the ASR component runs in the cloud on an external server as a dedicated web service, besides the message passing between the SignOn client and this ASR webservice. This is done via a secure and extensible restful Application Programming Interface (API). For each audio file as input, the ASR will produce output that will be the input for Natural Language Processing (NLP). In particular, the output contains text words, backchannel information, segmentation details, and confidence measures of each word.