吉林大学学报(理学版) ›› 2022, Vol. 60 ›› Issue (5): 1103-1112.

• • 上一篇    下一篇

基于Transformer的细粒度图像中文描述

肖雄1, 徐伟峰1, 王洪涛1, 苏攀1, 高思华2   

  1. 1. 华北电力大学(保定)  计算机系, 河北 保定 071003; 2. 中国民航大学 计算机科学与技术学院, 天津 300300
  • 收稿日期:2021-10-10 出版日期:2022-09-26 发布日期:2022-09-26
  • 通讯作者: 徐伟峰 E-mail:weifengxu@163.com

Chinese Caption of Fine-Grained Images Based on Transformer

XIAO Xiong1, XU Weifeng1, WANG Hongtao1, SU Pan1, GAO Sihua2   

  1. 1. Department of Computer, North China Electric Power University (Baoding), Baoding 071003, Hebei Province, China;
    2. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
  • Received:2021-10-10 Online:2022-09-26 Published:2022-09-26

摘要: 针对图像中文描述中传统循环神经网络(RNN)结构不利于生成长句、 缺乏细节语义信息的问题, 提出一种用Transformer多头注意力(multi-head attention, MHA)网络, 融合粗粒度的全局特征和细粒度的区域目标实体特征方法. 该方法通过多尺度特征的融合, 使图像注意力更易聚焦于细粒度的目标区域, 得到更具细粒度语义特征的图像表示, 从而有效改善了图像描述. 在数据集ICC上使用多种评价指标进行验证, 结果表明, 该模型在各项指标上均取得了更好的图像描述效果.

关键词: 图像中文描述, 细粒度特征, 多头注意力

Abstract: Aiming at the problem that the traditional recurrent neural network (RNN) structure in  image Chinese caption was not conducive to long sentence generation and lacked detailed semantic information, we proposed a Transformer multi-head attention (MHA) network, which  fused the coarse-grained global features and fine-grained regional target entity features.  Through the fusion of multi-scale features,  the method made it easier for image attention to focus on fine-grained target regions and an image representation with more fine-grained semantic features was obtained, thus effectively improving image caption. A variety of evaluation indicators were used for verification on the ICC dataset, the results show that the model achieves better image caption effects in all indicators.

Key words: image Chinese caption, fine-grained feature, multi-head attention (MHA)

中图分类号: 

  • TP391