Lara Verheyen, Jérôme Botoko Ekila, Jens Nevens, Paul Van Eecke and Katrien Beuls
The task of visual dialogue consists in modelling an agent that can hold a coherent and meaningful conversation discussing visual input with a human interlocutor. The challenges that come with visual dialogue are twofold. First, the agent needs to ground the utterances in the provided visual input. Second, due to the incremental nature of conversations, the utterances need to be grounded in the conversational history.
In this work, we propose a novel methodology for the task of visual dialogue, tackling the challenges described above. We build further on earlier work on procedural semantics for visual question answering (Nevens et al., 2019) and extend it with two novel mechanisms: a hybrid procedural semantic representation and a conversation memory. First, the hybrid procedural semantic representation describes the cognitive operations that need to be executed in order to find an answer to a question. This representation is then executed in a hybrid way, namely by combining symbolic and sub-symbolic operations, exploiting the strengths of both approaches. The symbolic operations are responsible for high-level reasoning on structured data such as the conversational history. The sub-symbolic operations are used for pattern finding on unstructured data, namely the images. The sub-symbolic operations are implemented by neural networks that perform an atomic task, such as finding a specific object in an image. For mapping the question onto its meaning representation, we designed a computational construction grammar that is 100% accurate. The second novel mechanism is the conversation memory, which is a data structure that incrementally and explicitly stores information from the conversation. The conversation memory stores – among other things – a symbolic representation of the topic of each turn including its grounding in the image. This memory can then be accessed by symbolic operations such as get-last-topic, which returns the topic of the previous turn.
We validate the methodology using standard benchmarks and achieve an accuracy of 98.5% on the MNIST Dialog dataset (Seo et al., 2017) and 95.9% on the CLEVR-Dialog dataset (Kottur et al., 2018). We strongly believe that the novel methodology which combines the use of hybrid procedural semantics and a conversation memory paves the way for the next wave of transparent and interpretable conversational agents that can hold coherent and meaningful conversations with their human interlocutors.
References
Satwik Kottur, Jose ́ M.F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. CLEVR- Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog. In 2019 NAACL Proceedings: Human Language Technologies, Volume 1 (Long and Short Papers), pages 582–595, Minneapolis, Minnesota, 2019. ACL.
Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. Visual reference resolution using attention memory for visual dialog. In Proceedings of the 31st International Conference on Neural Information Processing Systems, page 3722–3732, Red Hook, NY, USA, 2017. Curran Associates Inc.
Jens Nevens, Paul Van Eecke, and Katrien Beuls. Computational construction grammar for visual question answering. In Linguistics Vanguard, 5(1):20180070, 2019.