基于图注意力网络的视觉常识推理方法

A GRAPH ATTENTION NETWORK FOR VISUAL COMMON SENSE REASONING

摘要: 视觉常识推理(Visual Commonsense Reasoning,VCR)是近年来提出的一项具有挑战性的多模态任务。为了深入理解图像中的语义关系,提高VCR任务的性能,提出一种基于图注意力网络的视觉常识推理方法,针对多种类型的图像,将图像中的视觉对象编码为视觉节点,使用图注意力网络对视觉节点和其相邻节点的特征进行建模,得到视觉对象间的内部关联。该方法有效地捕获视觉对象间的动态交互,提高了图像的语义理解能力。在VCR数据集上进行实验表明,该方法在VCR三个子任务上的性能均有提升,证明了该方法的有效性。

Abstract: Visual common sense reasoning (VCR) is a challenging multimodal task proposed in recent years. In order to reason the semantic relationship in images and improve the performance of the VCR task, a graph attention network for visual common sense reasoning is proposed. The method encoded the visual objects for various images as visual nodes in the image and used the graph attention network to model the features of visual nodes and adjacent nodes to obtain the internal associations between the objects. In addition, the method effectively captured the dynamic interaction between visual objects and further improved the understanding of image semantics. Experiments on the VCR dataset show that the performance of the method on the three sub-tasks of VCR is improved.