Abstract:
Aimed at the problem that the existing image caption models lack attention to the local details of an image and tend to give general description, a Chinese image caption method combining encoder and visual keyword search is proposed. A fusion encoder was constructed, and the local and global features of an image were extracted simultaneously in a convolutional neural network (CNN) to enrich the semantic information of image features in long short-term memory (LSTM) decoding stage. Aimed at the problem of general expression, the image retrieval method based on convolutional neural network was used to find the potential visual words, and was integrated into the word vector generation process in the decoding stage. Reinforcement learning mechanism was introduced to optimize the CIDEr evaluation index at the sentence level to improve the lexical diversity of image description. Experimental results verify the effectiveness of the proposed method.