Vincent Vandeghinste and Oliver Guhr
When applying automated speech recognition (ASR) for Belgian Dutch (Van Dyck et al., 2021), the output consists of an unsegmented stream of words, without any punctuation. A next step is to perform segmentation and insert punctuation, making the ASR output more readable and supposedly more easy to manually correct, as the manual corrector can work on a segment by segment basis instead of an output “stream” of sometimes several thousand words.
We present an experiment on punctuation insertion in which we tested several approaches towards this problem. We first consider segmentation, i.e. splitting up the stream of words into sentence-like word sequences. As a baseline we tested a machine translation approach, similar to Vandeghinste et al. (2018), in which we consider texts with all punctuation and segmentation removed as the source language and the punctuated version as the target language. We trained an OpenNMT transformer model on a randomly resegmented version of the SONAR corpus complemented with the Corpus Spoken Dutch. Evaluation was performed on 1000 sentences from OpenSubtitles, and applied with a sliding window of size 20. For every sequence of 20 words we checked where the MT system inserted full stops (“.”) and if the number of full stop predictions was above a 10% threshold, a segmentation was predicted. In our best model and parameter setting, such an approach reached a segmentation prediction F-score of 59%, which seems not good enough for practical usage.
The model we present here is an extension of the models of Guhr et al (2021) for Dutch and is made available at https://huggingface.co/oliverguhr/fullstop-dutch-punctuation-prediction.
We trained a sequence classification model, based on the Dutch language model RobBERT (Delobelle et al., 2020). For every word in the input sequence, the model predicts a punctuation marker that follows the word. The model is trained to predict the following markers “.” “,” “?” “-” “:” and “0” if the word is not followed by a marker. The Dutch full stop model was trained on the Dutch in Europarl v8 data, to be analogous with the other languages in the model. The model achieved a F-score of 0.96 for segmentation, 0.81 for “,”, 0.85 for “?”, 0.46 for “-” and 0.66 for “;”. These results are comparable to those of the English, French, German and Italian models. Similar to the evaluation results of the SEPP NLG 2021 shared task, we experienced a degradation in the models performance on out-of-domain data.
The best score on the out-of-domain data from OpenSubtitles was reached with a sliding window of 200 and a threshold of 10%, and reached an F-score of 83.81% for segmentation, which is a gigantic leap forward compared to the baseline approach. We are investigating strategies to improve the models’ performance on unseen data, like creating a more diverse dataset covering different domains.
Furthermore, we are training multilingual models that are capable of processing all five languages with a single model.