Unsupervised text classification with neural word embeddings

Andriy Kosar, Guy De Pauw and Walter Daelemans

The paper presents the experiments and results of unsupervised multiclass text classification of news articles based on measuring semantic similarity between class labels and texts through word embeddings. The experiments conducted on English and Dutch news data of various lengths, demonstrate that the proposed approach outperforms frequency-based methods for multiclass text classification and can be employed for fast text classification in case of a lack of labeled data or frequent changes in class labels. The experiments evaluate a wide range of pre-trained (word2vec, fasttext, GloVe) and custom trained (word2vec and Doc2Vec) neural word embeddings and demonstrate that pre-trained word2vec embeddings are most suitable for news classification among other word embedding methods. The paper also proposes a method that improves the results of the aforementioned approach of unsupervised multiclass text classification with pre-trained word2vec embeddings by enriching the class label with the most similar words in the embedding space. Additionally, an error analysis indicates the limitations of such an approach for compound label classes and provides insights for further improvements of classification results.