Journal of Jilin University Science Edition ›› 2022, Vol. 60 ›› Issue (5): 1103-1112.

Previous Articles     Next Articles

Chinese Caption of Fine-Grained Images Based on Transformer

XIAO Xiong1, XU Weifeng1, WANG Hongtao1, SU Pan1, GAO Sihua2   

  1. 1. Department of Computer, North China Electric Power University (Baoding), Baoding 071003, Hebei Province, China;
    2. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
  • Received:2021-10-10 Online:2022-09-26 Published:2022-09-26

Abstract: Aiming at the problem that the traditional recurrent neural network (RNN) structure in  image Chinese caption was not conducive to long sentence generation and lacked detailed semantic information, we proposed a Transformer multi-head attention (MHA) network, which  fused the coarse-grained global features and fine-grained regional target entity features.  Through the fusion of multi-scale features,  the method made it easier for image attention to focus on fine-grained target regions and an image representation with more fine-grained semantic features was obtained, thus effectively improving image caption. A variety of evaluation indicators were used for verification on the ICC dataset, the results show that the model achieves better image caption effects in all indicators.

Key words: image Chinese caption, fine-grained feature, multi-head attention (MHA)

CLC Number: 

  • TP391