Understanding sentences in Dutch EHR: building a dataset from medical ontologies and sentence templates and leveraging it

François Remy, Stijn Dupulthys, Peter De Jaeger, Kris Demuynck and Thomas Demeeester

A significant portion of the research effort related to the processing of EHR unstructured text revolves around the notion of entity detection and entity linking, which are both important aspects of the task. However, clinical notes processing cannot be limited to mastering these well-defined settings only.

Indeed, it is critical not only to know what entities are mentioned in the clinical text, but also why they are mentioned: there are notable differences related to whether a patient has a disorder, thinks he might have the disorder, or knows a family member who had that disorder (among other options).

In this project, we set out to train a machine learning model which would not only be able to recognize medical concepts contained in unstructured medical text, but which could also produce a representation indicating its role in the patient’s records.

To achieve this, we generated a dataset containing million of sentences based on medical ontologies such as UMLS, and a set of “concept roles” defined on semantic subsets of these ontologies, for each of which multiple verbatim sentence templates were written in both English and Dutch which can be filled with medical concepts from the ontology.

We release this dataset in English and its Dutch counterpart (built on top of SnomedCT Belgium + Netherlands and manual translation of the templates) as part of this presentation, such that they might be iterated on and used by other researchers.

This presentation will also give a preview of the content of a future journal paper on the results obtained while training models on this dataset combined with other sources. This includes a Dutch model which can analyze unstructured text at the sentence level and provide better EHR representations of documents than concept extraction tools.