基于视觉对比注意力的隐式文本图像细粒度匹配

殷亚珏; 王晶晶

doi:10.3969/j.issn.1000-386x.2025.07.020

基于视觉对比注意力的隐式文本图像细粒度匹配

IMPLICIT TEXT-IMAGE FINE-GRAINED MATCHING VIA VISUAL CONTRASTIVE ATTENTION

摘要

摘要: 文本图像细粒度匹配任务旨在对齐图片和文本中的细粒度实体部分(例如:对齐图片中的目标对象与文本中涉及的短语)。与以往研究不同,该文提出一种新的面向隐式场景的文本图像细粒度匹配任务,该任务专注于处理需要依赖上下文语境或更多外部知识才能够识别出细粒度匹配关系的文本-图片。特别地,针对该新任务,制定相应的语料标注规范,并标注一个面向隐式场景的文本图像细粒度匹配数据集。在此基础上,提出一种基于视觉对比注意力的方法,用于缓解该新任务存在的语义匹配信息稀疏问题。实验结果表明,提出的视觉对比注意力的方法在隐式匹配任务上取得了显著的性能提升。

Abstract: The text-image fine-grained matching task aims to align fine-grained entities in pictures and texts (eg: aligning target objects in pictures with phrase involved in text). Different from previous studies, this paper proposes a novel implicit scene-oriented text-image fine-grained matching task, which focuses on processing fine-grained matching relationships that need to rely on context or more external knowledge to identify. In particular, for this new task, this paper formulated a corresponding corpus annotation specification and annotated a text-image fine-grained matching dataset for implicit scenes. On this basis, this paper proposed a method based on visual contrastive attention to alleviate the problem of sparse semantic matching information in this new task. Experimental results show that the proposed method of visual contrastive attention achieves significant performance improvement on implicit matching task.

HTML全文

参考文献(0)

施引文献

资源附件(0)