基于增强对象学习和注意力网络的视频描述方法

doi:10.13229/j.cnki.jdxbgxb.20240251

Abstract

Abstract:

In video captioning tasks， one of the common problems is that the object caption is not specific enough， mainly because the model does not fully learn the information of the objects in the video. Meanwhile， videos contain abundant feature information， such as object information， motion information， and contextual information， making it a challenging task to enhance the model’s ability to learn key information when generating captions. To address the aforementioned problems， this paper proposes a method based on enhanced object learning and attention networks. Firstly， a new enhanced object learning module was designed to fully learn object information in videos， thereby achieving accurate caption of video content. Secondly， an attention network was constructed to dynamically adjust the weights of different types of information， thereby enhancing the model’s ability to learn key information when generating captions. In the experiments on the MSVD and MSR-VTT datasets， the caption generated by the method proposed in this paper showed a higher level of specificity and accuracy， and exceeded the current advanced methods in various evaluation indicators， effectively verifying the feasibility of the method.

Key words: deep learning, video captioning, enhanced object learning, attention network

CLC Number:

TP391

Xiao-dong CAI,Shun-hong LONG,Kun-jun LIANG. Video captioning method based on enhanced object learning and attention networks[J].Journal of Jilin University(Engineering and Technology Edition), 2026, 56(2): 516-522.

Figures/Tables 6

Fig.1

Fig.2

Fig.3

Table 1

Table 2

Fig.4

References 17

[1]	Zhang J, Peng Y. Video captioning with object-aware spatio-temporal correlation and aggregation[J]. IEEE Transactions on Image Processing, 2020, 29: 6209-6222.
[2]	Zanfir M, Marinoiu E, Sminchisescu C. Spatio-temporal attention models for grounded video captioning[C]∥Computer Vision-ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 2017: 104-119.
[3]	Yang Z, Han Y, Wang Z. Catching the temporal regions-of-interest for video captioning[C]∥Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, USA, 2017: 146-153.
[4]	Zhang W, Wang X E, Tang S, et al. Relational graph learning for grounded video description generation[C]∥Proceedings of the 28th ACM International Conference on Multimedia, Seattle, USA, 2020: 3807-3828.
[5]	Zhang Z, Shi Y, Yuan C, et al. Object relational graph with teacher-recommended learning for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 13278-13288.
[6]	Kanani C S, Saha S, Bhattacharyya P. Global object proposals for improving multi-sentence video descriptions[C]∥International Joint Conference on Neural Network, Montreal, Canada, 2021: 1-7.
[7]	Parisotto E, Song F, Rae J, et al. Stabilizing transformers for reinforcement learning[C]∥International Conference on Machine Learning, Vienna, Austria, 2020: 7487-7498.
[8]	Ye H, Li G, Qi Y, et al. Hierarchical modular network for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17939-17948.
[9]	Lin K, Li L, Lin C C, et al. SwinBERT: end-to-end transformers with sparse attention for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 17949-17958.
[10]	Gu X, Chen G, Wang Y, et al. Text with knowledge graph augmented transformer for video captioning[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, USA, 2023: 18941-18951.
[11]	Jing S, Zhang H, Zeng P, et al. Memory-based augmentation network for video captioning[J]. IEEE Transactions on Multimedia, 2023, 26: 2367-2379.
[12]	Shen Y, Gu X, Xu K, et al. Accurate and fast compressed video captioning[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2023: 15558-15567.
[13]	Wang J, Jiang W, Ma L, et al. Bidirectional attentive fusion with context gating for dense video captioning[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7190-7198.
[14]	Wang B, Ma L, Zhang W, et al. Controllable video captioning with pos sequence guidance based on gated fusion network[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 2019: 2641-2650.
[15]	Li L, Gao X, Deng J, et al. Long short-term relation transformer with global gating for video captioning[J]. IEEE Transactions on Image Processing, 2022, 31: 2726-2738.
[16]	Xu J, Yao T, Zhang Y, et al. Learning multimodal attention LSTM networks for video captioning[C]∥Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, USA, 2017: 537-545.
[17]	Sun Z, Chen S, Zhong L. Visual-aware attention dual-stream decoder for video captioning[C]∥IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, 2022: 1-6.

Related Articles 15

[1]	Zong-wei YAO,Chen CHEN,Zhen-yun GAO,Hong-peng JIN,Hao RONG,Xue-fei LI,Hong-pu HUANG,Qiu-shi BI. Visual recognition of excavator keypoints based on synthetic image datasets [J]. Journal of Jilin University(Engineering and Technology Edition), 2026, 56(1): 76-85.
[2]	Lin-hong WANG,Yu-yang LIU,Zi-yu LIU,Ying-jia LU,Yu-heng ZHANG,Gui-shu HUANG. Defect recognition of lightweight bridges based on YOLOv5 [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(9): 2958-2968.
[3]	Jing LIAN,Ji-bao ZHANG,Ji-zhao LIU,Jia-jun ZHANG,Zi-long DONG. Text-based guided face image inpainting [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(8): 2732-2740.
[4]	Yuan-ning LIU,Xing-zhe WANG,Zi-yu HUANG,Jia-chen ZHANG,Zhen LIU. Stomach cancer survival prediction model based on multimodal data fusion [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(8): 2693-2702.
[5]	Jing-shu YUAN,Wu LI,Xing-yu ZHAO,Man YUAN. Semantic matching model based on BERTGAT-Contrastive [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(7): 2383-2392.
[6]	Hui-zhi XU,Dong-sheng HAO,Xiao-ting XU,Shi-sen JIANG. Expressway small object detection algorithm based on deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 2003-2014.
[7]	Ru-bo ZHANG,Shi-qi CHANG,Tian-yi ZHANG. Review on image information hiding methods based on deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1497-1515.
[8]	Jian LI,Huan LIU,Yan-qiu LI,Hai-rui WANG,Lu GUAN,Chang-yi LIAO. Image recognition research on optimizing ResNet-18 model based on THGS algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1629-1637.
[9]	Bin WEN,Yi-fu DING,Chao YANG,Yan-jun SHEN,Hui LI. Self-selected architecture network for traffic sign classification [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1705-1713.
[10]	Zhen-jiang LI,Li WAN,Shi-rui ZHOU,Chu-qing TAO,Wei WEI. Dynamic estimation of operational risk of tunnel traffic flow based on spatial-temporal Transformer network [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(4): 1336-1345.
[11]	Meng-xue ZHAO,Xiang-jiu CHE,Huan XU,Quan-le LIU. A method for generating proposals of medical image based on prior knowledge optimization [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(2): 722-730.
[12]	Hu JIN,Yu-sheng SHEN,Yong FANG,Li YU,Jia-mei ZHOU. Identification of small cracks in highway tunnel lining based on deep learning SSD algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(11): 3653-3659.
[13]	Xiao-Dong CAI,Ye-yang HUANG,Li-fang DONG. Semantic similarity model based on augmented positives and interlayer negatives [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(11): 3705-3714.
[14]	Lai-wei JIANG,Ce WANG,Hong-yu YANG. Review of multi-object tracking based on deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(11): 3429-3445.
[15]	Wei WANG,Yu-jie SUN,Xin WANG. Lightweight frequency and spatial feature fused multi-scale remote sensing scene classification network [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(10): 3361-3371.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

方法	MSVD				MSR-VTT
方法	B4	M	C	R	B4	M	C	R
DMRM^［3］	51.1	33.6	74.8	—	—	—	—	—
ORG-TRL^［5］	54.3	36.4	95.2	73.9	43.6	28.8	50.9	62.1
POS-CG^［14］	52.5	34.1	88.7	71.3	42.0	28.2	48.7	61.6
LSRT^［15］	55.6	37.1	98.5	73.5	42.6	28.3	49.5	61.0
MA-LSTM^［16］	52.3	33.6	70.4	—	36.5	26.5	41.0	59.8
VADD^［17］	51.5	34.8	72.1	91.5	42.4	28.2	49.7	61.7
SwinBERT^［9］	58.2	41.3	120.6	77.5	41.9	29.9	53.8	62.1
TextKG^［10］	60.8	38.5	105.2	75.1	43.7	29.6	52.4	62.4
MAN^［11］	60.1	37.1	101.9	74.6	42.5	28.6	50.4	62.2
ViT/L14^［12］	60.1	41.4	121.5	78.2	44.4	30.3	57.2	63.4
HMN^［8］	59.2	37.7	104.0	75.1	43.5	29.0	51.5	62.7
EOLM-AN	61.5	39.0	106.7	77.5	44.2	29.4	52.1	63.6

方法	MSVD				MSR-VTT
方法	B4	M	C	R	B4	M	C	R
EOLM	61.3	38.1	105.3	77.2	44.1	29.1	51.7	63.4
AN	60.5	38.5	106.2	76.2	43.7	29.3	51.9	62.9
EOLM-AN	61.5	39.0	106.7	77.5	44.2	29.4	52.1	63.6

Video captioning method based on enhanced object learning and attention networks

RICH HTML

PDF (PC)