Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann and Walter Daelemans
Evaluating open-domain conversations remains an open problem. The default choice is to hire human annotators, although they have shortcomings: inconsistency, lack of reproducibility and cost. Researchers proposed several automated metrics for the evaluation of open-domain conversations.
In this paper, we present a new automated evaluation method (FULL) inspired by the use of follow-ups as evaluation method. We append follow-ups (e.g. What are you trying to say?) at the end of conversations and measure the log-likelihood of a language model to generate these follow-ups.
We show that a language model evaluates a conversation by the likely presence of negative follow-ups (e.g. What are you trying to say?) rather than positive ones (e.g. Wow, super interesting).
FULL exhibits the strongest correlation with human evaluations compared to 12 other automated metrics.