基于三维卷积与自注意力机制的视频行人重识别

doi:10.13229/j.cnki.jdxbgxb.20230881

吉林大学学报(工学版) ›› 2025, Vol. 55 ›› Issue (7): 2409-2417.doi: 10.13229/j.cnki.jdxbgxb.20230881

• 计算机科学与技术 • 上一篇

基于三维卷积与自注意力机制的视频行人重识别

庄珊娜^1,²(),王君帅^1,²,白晶^1,²(),杜京瑾^1,²,王正友^1,²

^1.石家庄铁道大学信息科学与技术学院，石家庄 050043
^2.河北省电磁环境效应与信息处理重点实验室，石家庄 050043

收稿日期:2023-08-19 出版日期:2025-07-01 发布日期:2025-09-12
通讯作者: 白晶 E-mail:zhuangshn@stdu.edu.cn;jingbai01@126.com
作者简介:庄珊娜（1985-），女，副教授，博士. 研究方向：计算机视觉及行人重识别.E-mail： zhuangshn@stdu.edu.cn
基金资助:
国家自然科学基金项目(62341121);河北省自然科学基金项目(F2023210001);河北省教育厅青年基金项目(QN2022086);河北省教育厅青年基金项目(QN2025052);河北省社会科学基金项目(HB19JY017)

Video-based person re-identification based on three-dimensional convolution and self-attention mechanism

Shan-na ZHUANG^1,²(),Jun-shuai WANG^1,²,Jing BAI^1,²(),Jing-jin DU^1,²,Zheng-you WANG^1,²

^1.School of Information Science and Technology，Shijiazhuang Tiedao University，Shijiazhuang 050043，China
^2.Hebei Provincial Key Laboratory of Electromagnetic Environmental Effects and Information Processing，Shijiazhuang 050043，China

Received:2023-08-19 Online:2025-07-01 Published:2025-09-12
Contact: Jing BAI E-mail:zhuangshn@stdu.edu.cn;jingbai01@126.com

摘要/Abstract

摘要：

针对视频行人重识别通常存在行人姿势多变、特征融合损失和遮挡干扰等问题，本文提出了三维卷积与自注意力机制融合的网络模型。该模型首先通过三维卷积获取相邻帧之间的短期运动特征和局部细节特征；其次提出多分支结构的跨尺度融合模块，融合时间和语义维度的多阶段特征；最后利用自注意力机制模块捕获长时信息。本文所提网络在3个经典公开数据集上实验，并与其他方法进行对比，实验结果表明，该网络可有效利用视频长、短时信息，获得较优性能。

关键词: 计算机应用, 视频行人重识别, 时序信息, 三维卷积, 自注意力机制

Abstract:

To solve the problem of person posture change， feature fusion loss and blocking interference in video-based person re-identification，a fusion network model of three-dimensional convolution and self-attention mechanism is proposed in this paper. Firstly， short-term motion features and local detail features between adjacent frames are obtained through three-dimensional convolution. Secondly， a cross scale fusion module with multi-branch structure is proposed， multi-stage features of time and semantic dimensions are fused. Fnally， long-term features are captured by a self-attention mechanism. Compared with other methods， this paper proposed network was tested on three commonly used public datasets， the experiments results demonstrate that the proposed network can efficiently utilized the long-term and short-term information of videos so that advanced performances can be obtained.

Key words: computer application, video-based person re-identification, temporal information, three-dimensional convolution, self-attention mechanism

中图分类号:

TP391.4

庄珊娜,王君帅,白晶,杜京瑾,王正友. 基于三维卷积与自注意力机制的视频行人重识别[J]. 吉林大学学报(工学版), 2025, 55(7): 2409-2417.

Shan-na ZHUANG,Jun-shuai WANG,Jing BAI,Jing-jin DU,Zheng-you WANG. Video-based person re-identification based on three-dimensional convolution and self-attention mechanism[J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(7): 2409-2417.

图/表 13

图1

图2

图3

图4

图5

表1

表2

表3

表4

表5

表6

表7

表8

参考文献 27

[1]	于鹏, 朴燕. 基于多尺度特征的行人重识别属性提取新方法[J]. 吉林大学学报: 工学版, 2023, 53(4): 1155-1162.
	Yu Peng, Yan Piao. New method for extracting person re-identification attributes based on multi-scale features[J]. Journal of Jilin University (Engineering and Technology Edition), 2023, 53(4): 1155-1162.
[2]	Isobe T, Han J, Zhuz F, et al. Intra-clip aggregation for video person re-identification[C]∥Proceeding of the IEEE International Conference on Image Processing (ICIP). Piscatanay, NJ: IEEE, 2020: 2336-2340.
[3]	Li X, Zhou W, Zhou Y, et al. Relation-guided spatialattention and temporal refinement for video-based person re-identification[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11434-11441.
[4]	Hou R, Chang H, Ma B, et al. Temporal complementary learning for video person re-identification[C]∥The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 388-405.
[5]	Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks[C]∥Proceeding of the 53rd Annual Meeting of the Association for Computation Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 2015: 1556-1566.
[6]	Carreira J, Zisserman A. Quo vadis, action recogniti-on? A new model and the kinetics dataset[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii,USA,2017: 6299-6308.
[7]	Li J, Zhang S, Huang T. Multi-scale 3d convolution network for video based person re-identification[C]∥Proceedings of the AAAI Conference on Artificial Int-elligence, Hawaii,USA, 2019, 33(1): 8618-8625.
[8]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, Los Angeles,USA, 2017: 30-38.
[9]	侯春萍, 杨庆元, 黄美艳, 等. 基于语义耦合和身份一致性的跨模态行人重识别方法[J]. 吉林大学学报:工学版, 2022, 52(12): 2954-2963.
	Hou Chun-ping, Yang Qing-yuan, Huang Mei-yan, et al. Cross⁃modality person re⁃identification based on semantic coupling and identity⁃consistence constraint[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(12): 2954-2963.
[10]	Zheng L, Bie Z, Sun Y, et al. Mars: a video benchmark for large-scale person re-identification[C]∥The 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 868-884.
[11]	Wu Y, Lin Y, Dong X, et al. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 5177-5186.
[12]	Wang T, Gong S, Zhu X, et al. Person re-identification by video ranking[C]∥The 13th European Conference on Computer Vision, Zurich, Switzerland, 2014: 688-703.
[13]	Zhong Z, Zheng L, Kang G, et al. Random erasing data augmentation[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, New York,USA, 2020, 34(7): 13001-13008.
[14]	Kingma D P, Ba J. Adam: a method for stochastic optimization[J/OL].[2014-03-22]. .
[15]	Chen Z, Zhou Z, Huang J, et al. Frame-guided regi-on-aligned representation for video person re-identification[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, New York,USA, 2020: 10591-10598.
[16]	Hou R, Ma B, Chang H, et al. Vrstc: occlusion-free video person re-identification[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles,USA, 2019: 7183-7192.
[17]	Yang J, Zheng W S, Yang Q, et al. Spatial-temporal graph convolutional network for video-based person re-identification[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 3289-3299.
[18]	Zhang Z, Lan C, Zeng W, et al. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle,USA, 2020: 10407-10416.
[19]	Chen G, Rao Y, Lu J,et al.Temporal coherence or temporal motion:Which is more critical for video-based person re-identification?[C]∥The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 660-676.
[20]	Liu X, Zhang P, Yu C,et al.Watching you: global-guided reciprocal learning for video-based person re-identification[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Shenzhen, China, 2021: 13334-13343.
[21]	Hou R, Chang H, Ma B, et al. Learning efficient spatial-temporal representation for video person re-identification[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 21-24.
[22]	Jiang X, Qiao Y, Yan J,et al. SSN3D: self-separated network to align parts for 3D convolution in video person re-identification[C]∥Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2021, 35(2): 1691-1699.
[23]	Bai S, Ma B, Chang H, et al. Salient-to-broad transition for video person re-identification[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans,USA, 2022: 7339-7348.
[24]	Ning J, Li F, Liu R, et al. Temporal extension topology learning for video-based person reidentificat-ion[C]∥Proceedings of the Asian Conference on Co-mputer Vision, Macao, China, 2022: 207-219.
[25]	Pan H, Chen Y, He Z. Multi-granularity graph pooling for video-based person re-identification[J]. Neural Networks, 2023, 160: 22-33.
[26]	Yang F, Li W, Liang B, et al. Spatiotemporal interaction transformer network for video-based person re-identification in internet of things[J]. IEEE Internet of Things Journal, 2023,10(14):12537-12547.
[27]	Wang K, Ding C, Pang J, et al. Context sensing att-ention network for video-based person re-identification[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2023, 19(4): 1-20.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

方法	插入的层级	MARS
方法	插入的层级	mAP	Rank-1
基线	-	81.2	86.6
	第一层	81.6	86.7
	第二层	81.8	86.8
	第三层	81.9	86.9
	第四层	81.7	81.7
	第二层+第三层	82.2	87.1

方法	模块数量	替换残差块的位置	MARS
方法	模块数量	替换残差块的位置	mAP	Rank-1
基线	1	Conv3_0	81.6	86.7
		Conv3_1	81.4	86.7
		Conv3_2	82.0	86.8
		Conv3_3	81.5	86.6
	2	Conv3_0， Conv3_2	82.2	86.9
	3	Conv3_0， Conv3_2， Conv3_3	82.0	86.8

方法	模块数量	替换残差块的位置	MARS
方法	模块数量	替换残差块的位置	mAP	Rank-1
基线	1	Conv4_0	81.9	86.9
		Conv4_1	81.8	86.8
		Conv4_2	82.3	87.0
		Conv4_3	81.7	86.9
		Conv4_4	82.5	87.0
		Conv4_5	81.7	86.8
	2	Conv4_2， Conv4_4	82.7	87.1
	3	Conv4_0， Conv4_2， Conv4_4	83.1	87.2
	4	Conv4_0， Conv4_2， Conv4_4， Conv4_5	83.0	86.8

方法	插入层级	替换残差块的位置	MARS
方法	插入层级	替换残差块的位置	mAP	Rank-1
基线	第二层	Conv3_0， Conv3_2	82.2	86.9
	第三层	Conv4_0， Conv4_2， Conv4_4	83.1	87.2
	第二层第三层	Conv3_0， Conv3_2 Conv4_0， Conv4_2， Conv4_4	83.8	87.5

方法	MARS		iLIDS-VID
方法	mAP	Rank-1	Rank-1	Rank-5
基线	81.2	86.6	84.0	95.0
基线+3D卷积	83.8	87.5	87.8	96.2
基线+自注意力	84.0	87.8	88.5	96.8
基线+3D卷积+自注意力	84.6	88.8	90.4	97.3

基于三维卷积与自注意力机制的视频行人重识别

Video-based person re-identification based on three-dimensional convolution and self-attention mechanism

RICH HTML

PDF (PC)

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 27

相关文章 15

Metrics

本文评价

推荐阅读 0

分支1		分支2		MARS
高	宽	高	宽	mAP/%	Rank-1/%
256	128	-	-	84.8	88.5
128	64	-	-	83.4	87.0
64	32	-	-	67.9	77.6
256	128	128	64	85.2	89.5
256	128	64	32	84.0	88.0
128	64	64	32	78.6	84.1

方法	MARS
方法	mAP	Rank-1
分支1+分支2	85.2	89.5
分支1+分支2+CSFM	85.5	89.7
分支1+分支2+SAM	85.6	89.9
分支1+分支2+MSFM	85.8	90.1

方法	来源	MARS		DukeMTMC-VideoReID		iLIDS-VID
方法	来源	mAP	Rank-1	mAP	Rank-1	Rank-1	Rank-5
M3D^［7］	IEEE2020	79.5	88.6	93.7	95.5	86.7	-
FGRA^［15］	AAAI2020	81.2	87.3	-	-	88.0	96.7
VRSTC^［16］	CVPR2020	82.3	88.5	93.5	95.0	83.4	-
STGCN^［17］	CVPR2020	83.7	89.9	95.7	97.2	-	-
MG-RAFA^［18］	CVPR2020	85.9	88.8	-	-	88.6	98.0
AFA^［19］	ECCV2020	82.9	90.2	95.4	97.2	88.5	96.8
TCLNet^［4］	ECCV2020	85.1	89.8	96.2	96.9	86.6	-
GRL^［20］	CVPR2021	84.8	91.0	93.8	95.0	90.4	98.3
BiCnet-TKS^［21］	CVPR2021	86.0	90.2	96.1	96.3	-	-
SSN3D^［22］	AAAI2021	86.2	90.1	96.3	96.8	88.9	97.3
SINet^［23］	CVPR2022	86.2	90.0	-	-	92.5	-
TE-AGC^［24］	ACCV2022	85.8	90.7	96.3	97.2	-	-
GPNet^［25］	Elsevier2023	86.0	90.5	96.3	97.3	88.5	98.5
SITN^［26］	IEEE2023	85.1	89.2	96.0	97.2	94.0	100
CSA-Net^［27］	ACM2023	84.5	90.4	96.7	97.7	90.0	98.3
本文	-	86.2	90.7	96.2	97.0	92.7	98.9

[1]	车翔玖,李良. 融合全局与局部细粒度特征的图相似度度量算法[J]. 吉林大学学报(工学版), 2025, 55(7): 2365-2371.
[2]	李文辉,杨晨. 基于对比学习文本感知的小样本遥感图像分类[J]. 吉林大学学报(工学版), 2025, 55(7): 2393-2401.
[3]	赵宏伟,周伟民. 基于数据增强的半监督单目深度估计框架[J]. 吉林大学学报(工学版), 2025, 55(6): 2082-2088.
[4]	陈海鹏,张世博,吕颖达. 多尺度感知与边界引导的图像篡改检测方法[J]. 吉林大学学报(工学版), 2025, 55(6): 2114-2121.
[5]	周丰丰,郭喆,范雨思. 面向不平衡多组学癌症数据的特征表征算法[J]. 吉林大学学报(工学版), 2025, 55(6): 2089-2096.
[6]	王健,贾晨威. 面向智能网联车辆的轨迹预测模型[J]. 吉林大学学报(工学版), 2025, 55(6): 1963-1972.
[7]	车翔玖,孙雨鹏. 基于相似度随机游走聚合的图节点分类算法[J]. 吉林大学学报(工学版), 2025, 55(6): 2069-2075.
[8]	刘萍萍,商文理,解小宇,杨晓康. 基于细粒度分析的不均衡图像分类算法[J]. 吉林大学学报(工学版), 2025, 55(6): 2122-2130.
[9]	侯越,郭劲松,林伟,张迪,武月,张鑫. 分割可跨越车道分界线的多视角视频车速提取方法[J]. 吉林大学学报(工学版), 2025, 55(5): 1692-1704.
[10]	赵宏伟,周明珠,刘萍萍,周求湛. 基于置信学习和协同训练的医学图像分割方法[J]. 吉林大学学报(工学版), 2025, 55(5): 1675-1681.
[11]	申自浩,高永生,王辉,刘沛骞,刘琨. 面向车联网隐私保护的深度确定性策略梯度缓存方法[J]. 吉林大学学报(工学版), 2025, 55(5): 1638-1647.
[12]	王友卫,刘奥,凤丽洲. 基于知识蒸馏和评论时间的文本情感分类新方法[J]. 吉林大学学报(工学版), 2025, 55(5): 1664-1674.
[13]	王军,司昌馥,王凯鹏,付强. 融合集成学习技术和PSO-GA算法的特征提取技术的入侵检测方法[J]. 吉林大学学报(工学版), 2025, 55(4): 1396-1405.
[14]	徐涛,孔帅迪,刘才华,李时. 异构机密计算综述[J]. 吉林大学学报(工学版), 2025, 55(3): 755-770.
[15]	赵孟雪,车翔玖,徐欢,刘全乐. 基于先验知识优化的医学图像候选区域生成方法[J]. 吉林大学学报(工学版), 2025, 55(2): 722-730.