融合编码器和视觉关键词搜索的图像中文描述

孟繁聪; 徐伟; 李海波; 吴闻; 郑骏杰; 陈兴

doi:10.3969/j.issn.1000-386x.2025.04.030

融合编码器和视觉关键词搜索的图像中文描述

A CHINESE IMAGE CAPTIONING METHOD BASED ON FUSION ENCODER AND VISUAL KEYWORD SEARCH

摘要

摘要: 针对当前已有模型缺乏对图像局部细节的关注以及趋向于通用型描述问题，提出一种采用融合编码器和视觉关键词搜索技术的图像中文描述方法。构建融合编码器，在一个卷积神经网络（CNN）中同时提取图像的局部和全局特征，丰富长短时记忆网络（LSTM）解码的语义信息；针对图像描述一般性表达，采用基于 CNN 的图像检索方法查找潜在视觉词汇，用于词向量解码；引入强化学习机制，在 CIDEr 评估指标上做句子层面上的优化，用以提高图像描述的词汇多样性。实验结果验证了所提方法的有效性。

Abstract: Aimed at the problem that the existing image caption models lack attention to the local details of an image and tend to give general description, a Chinese image caption method combining encoder and visual keyword search is proposed. A fusion encoder was constructed, and the local and global features of an image were extracted simultaneously in a convolutional neural network (CNN) to enrich the semantic information of image features in long short-term memory (LSTM) decoding stage. Aimed at the problem of general expression, the image retrieval method based on convolutional neural network was used to find the potential visual words, and was integrated into the word vector generation process in the decoding stage. Reinforcement learning mechanism was introduced to optimize the CIDEr evaluation index at the sentence level to improve the lexical diversity of image description. Experimental results verify the effectiveness of the proposed method.

HTML全文

参考文献(0)

施引文献

资源附件(0)