New meets old: Word Sense Disambiguation with Embeddings

Hans Van Halteren and Jorrit Visser

Nowadays, most NLP tasks are approached in a deep learning setting. The words are embedded with a BERT-like model and all further processing is done on the basis of the embeddings. As the embeddings include the context of the word, we expect that disambiguation of any ambiguous words is no longer needed. Note that this was not yet the case for word embeddings like word2vec, which were linked to the word form without its context, so that the embedding mixed all interpretations of the word, with the exact mix influenced by their relative frequency. In earlier approaches, disambiguation did play a part. POS-tagging was supposed to solve syntactic ambiguity and Word Sense Disambiguation semantic ambiguity. For some tasks, such as lexicographical ones, symbolic disambiguation should still be preferred. Word embeddings may be good for machine learning, but they are not quite suitable for human consumption.

In this paper, we investigate whether the presumably better processing with BERT-like embeddings can be used to annotate texts with symbolic representations that disambiguate the contained word forms. We focus on WSD for English text. For each word form, we generate embeddings for all occurrences in the British National corpus and apply clustering to these embeddings. The cluster ids are then used as annotation. We do this in two ways. In the first, we use the BERT embeddings (768-dimensional vectors) themselves and apply k-means clustering. Cluster ids are arbitrarily assigned numbers. In the other, we use vectors with the 200 most likely substitute words as provided by RoBERTa masked word prediction and hierarchical clustering. Here, cluster ids consist of vectors with the most observed substitute words.

As an example, consider the fragments
1) At the same time raise the >feet< off the floor as far possible .
2) By the time that the glider is down to 500 >feet< or so ,
With the BERT embeddings, fragment 1 is assigned to clusters 0 and fragment 2 to cluster 1. With the RoBERTa predictions, fragment 1 is annotated with [‘hands’, ‘legs’, ‘foot’, ‘arms’, ‘head’, ‘knees’, ‘toes’, ‘hand’, ‘shoulders’, ‘eyes’] and fragment 2 with [‘meters’, ‘metres’, ‘yards’, ‘points’, ‘degrees’, ‘miles’, ‘foot’, ‘m’, ‘people’, ‘steps’].

Now, the cluster id consisting of words is much more usable for human processing, as it is a direct representation of the meaning. With the numerical ids, we would have to refer to a list of the example sentences that led to the most central embeddings in the corresponding cluster. However, when moving from full embeddings to substitute word vectors, we are bound to lose information and likely also clustering quality. In order to see whether the added convenience is worth the quality loss, we evaluate the two methods by annotating SemCor with them. We consider both the convenience (are the representations indeed indicative of the sense?) and the quality (how well does the clustering replicate the sense annotations in SemCor?).