Smooth operators. Development and effects of personalized conversational AI.

Anouck Braggaar, Gabriëlla Martijn, Christine Liebrecht, Charlotte van Hooijdonk, Emiel van Miltenburg, Florian Kunneman, Emiel Krahmer, Hans Hoeken and Hedwig te Molder

Organizations are increasingly implementing chatbots to provide customer service as chatbots are always available and can help customers quickly. However, there are still improvements to be made for chatbots to reach their full potential since 1) chatbot technology still faces some limitations, 2) customers perceive chatbot communication as unnatural and impersonal, and 3) customer service employees are still trying to find their way in collaborating with their new ‘colleague.’ In this 4-year NWO-funded project, we aim to develop and evaluate chatbots with a human touch to improve customers’ and employees’ collaboration and experience within a customer service context.

In the first year of the project, we focused on the evaluation of customer service chatbots on the one hand, and the evaluation of the multifaceted collaboration between customer service employees and chatbots on the other hand.

A systematic literature review was conducted to investigate how chatbots as task-based dialogue systems are evaluated within different fields of study. While the more technical fields (such as NLP) seem to focus to a great extent on automatic metrics, the more business-oriented fields (such as communication science) often make use of human evaluations. By conducting a search in four databases (ACL, ACM, IEEE and Web of Science) 3,800 records were retrieved that contained an evaluation of task-oriented dialogue systems/chatbots or discussed evaluation techniques. After screening, 146 studies were included in the literature review. These papers were assessed on what evaluation techniques were used, how they were used and in what context. The final goal of the study is to make an overview of metrics that are used in the technical fields and make them understandable and usable for the business-oriented fields.

The perceptions of managers, conversational designers, and human agents regarding their criteria for evaluating human chatbot collaboration were examined by means of an interview study. Our study found that all parties used their own criteria to evaluate the collaboration and that the evaluation criteria used varied according to the job positions interviewees held. Managers evaluate the chatbot collaboration in terms of cost reduction. Conversational designers perceive both customers as well as human agents as their ‘customers’, focusing on customer satisfaction as their main evaluation criteria. Human agents evaluate the collaboration by looking at the extent to which collaborating with the chatbot has positively affected their job satisfaction and has resulted in traffic improvements. Finally, in terms of improvements, our results showed that both human agents and conversational designers advocate back-end integration of the chatbot to improve collaboration. However, it also became clear that with this collaboration, new dilemmas arise, such as team alignment and privacy issues related to the processing of personal data. Such insights could be considered in future chatbot design to make the collaboration within human chatbot teams run as smoothly as possible and in that respect benefit organizations, human agents, and customers.