基于图结构引导和位置信息强化的人体姿态估计

doi:10.13229/j.cnki.jdxbgxb.20240086

摘要/Abstract

摘要：

高自由度的肢体常构成各种复杂的姿态，极易产生关键点被遮挡的现象，定位遮挡关键点是人体姿态估计的难点之一，针对上述难点，提出了一种图结构引导并强化关键点位置信息的人体姿态估计方法。首先该方法在高分辨率网络中融入位置信息强化模块，用于提升可见关键点空间位置信息的表征精度。然后，在主干网络并行支路中引入视觉图神经模块，引导网络提取包含人体关键点的相关特征，在像素坐标空间中挖掘关键点之间局部和全局的拓扑连接关系，以便推测被遮挡关键点的位置信息。最后，结合关键点热图聚合单元和语义图卷积网络，在语义空间中更新各关键点间的亲和力权重，表示躯干结构约束下关键点之间的拓扑依赖关系，进一步优化被遮挡关键点的估计。本文模型在COCO2017测试集上的平均精度达到78.1%，能够精准估计复杂姿态中易被遮挡的关键点。

关键词: 计算机视觉, 人体姿态估计, 关键点, 图卷积

Abstract:

The high degree of freedom of human limbs often constitutes complex poses in which the key points are prone to occluded， and locating the occluded key points is one of the difficulties in human pose estimation. To this end， this paper proposed a method with a guided graph structure and enhanced key points location information. The method incorporates a location information enhancement module in the HRNet， which can improve the representation of the spatial location information of visible key points. A visual graph neural module is integrated into backbone network to extract relevant features containing key points and exploit the local and global topological connectivity relationships between key points in pixel coordinate space to infer the location information of the occluded key points. Finally， a heatmap aggregation unit and a semantic graph convolutional network are employed to update the affinity weights between key points in the semantic space， which can represent the topological dependencies between key points under the constraints of the skeleton structure and further optimize the estimation of the occluded key points. The proposed model achieves an average accuracy of 78.1% on the COCO2017 test set， and can accurately estimate the occluded key points prone to occlusion in complex poses.

Key words: computer vision, human pose estimation, key points, graph convolution

中图分类号:

TP391.4

关欣,周子健,李锵. 基于图结构引导和位置信息强化的人体姿态估计[J]. 吉林大学学报(工学版), 2025, 55(10): 3283-3295.

Xin GUAN,Zi-jian ZHOU,Qiang LI. Human pose estimation based on graph structure guidance and location information enhancement[J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(10): 3283-3295.

图/表 12

图1

图2

图3

图4

图5

图6

表1

表2

表3

表4

图7

图8

参考文献 34

[1]	Eduardo RDS, Adams LS, Stoffel R A, et al. Monocular multi-person pose estimation: a survey[J]. Pattern Recognition, 2021, 118: No.108046.
[2]	田皓宇, 马昕, 李贻斌. 基于骨架信息的异常步态识别方法[J]. 吉林大学学报: 工学版, 2022, 52(4): 725-737.
	Tian Hao-yu, Ma Xin, Li Yi-bin. Abnormal gait recognition method based on skeleton information[J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 725-737.
[3]	Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[4]	Toshev A, Szegedy C. DeepPose: Human pose estimation via deep neural networks[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1653-1660.
[5]	Tompson J, Jain A, Lecun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[C]∥Neural Information Processing Systems,Montreal, Canada, 2014: 1799-1807.
[6]	Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation[C]∥European Conference on Computer Vision, Amsterdam, Netherlands, 2016: 483-499.
[7]	Chen Y L, Wang Z C, Peng Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7103-7112.
[8]	Xiao B, Wu H P, Wei Y C. Simple baselines for human pose estimation and tracking[C]∥European Conference on Computer Vision, Munich, Germany, 2018: 472-487.
[9]	Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation [C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5686-5796.
[10]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]∥Neural Information Processing Systems(NeurIPS),Long Beach, USA, 2017: 5998-6008.
[11]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]∥International Conference on Learning Representations, Online, 2021.
[12]	Li Y J, Zhang S K, Wang Z C, et al. Tokenpose: Learning keypoint tokens for human pose estimation[C]∥Proceedings of the IEEE International Conference on Computer Vision(ICCV),Montreal, Canda, 2021: 11293-11302.
[13]	Yuan Y H, Fu R, Huang L, et al. Hrformer: high-resolution transformer for dense prediction[J]. Advances in Neural Information Processing Systems, 2021, 34: 7281-7293.
[14]	Yang S, Quan Z B, Nie M, et al. Transpose: Keypoint localization via transformer[C]∥Proceedings of the IEEE International Conference on Computer Vision, Montreal, Canda, 2021: 11782-11792.
[15]	Li G H, Müller M, Thabet A, et al. DeepGCNs: Can GCNs Go As Deep As CNNs?[C]∥IEEE International Conference on Computer Vision, Seoul, South Korea, 2019: 9266-9275.
[16]	Qiu L T, Zhang X Y Y, Li Y R, et al. Peeking into occluded joints: A novel framework for crowd pose estimation[C]∥European Conference on Computer Vision, Glasgow, UK, 2020: 488-504.
[17]	Bin Y R, Chen Z M, Wei X S, et al. Structure-aware human pose estimation with graph convolutional networks[J]. Pattern Recognition, 2020, 106: No.107410.
[18]	Wang J, Long X, Gao Y, et al. Graph-PCNN: Two stage human pose estimation with graph pose refinement[C]∥European Conference on Computer Vision, Glasgow, UK, 2020: 492-508.
[19]	Banik S, GarcÍa A M, Knoll A. 3D human pose regression using graph convolutional network[C]∥IEEE International Conference on Image Processing(ICIP), Anchorage, USA, 2021: 924-928.
[20]	Hou Q B, Zhou D Q, Feng J S. Coordinate attention for efficient mobile network design[C]∥IEEE Conference on Computer Vision and Pattern Recognition. Nashville, USA, 2021: 13708-13717.
[21]	Han K, Wang Y H, Guo J Y, et al. Vision gnn: An image is worth graph of nodes[J]. Advances in Neural Information Processing Systems, 2022, 35: 8291-8303.
[22]	Huang G, Liu Z, Laurens V D M, et al. Densely connected convolutional networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, USA, 2017: 4700-4708.
[23]	Ding X H, Guo Y C, Ding G G, et al. ACNet: Strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks[C]∥International Conference on Computer Vision, Seoul, South Korea, 2019: 1911-1920.
[24]	Zhao L, Peng X, Tian Y, et al. Semantic graph convolutional networks for 3D Human Pose Regression[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3420-3430.
[25]	Wang X L, Girshick R, Gupta A, et al. Non-local neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Salt Lake City, USA, 2018: 7794-7803.
[26]	Yang J W, Lu J S, Lee S, et al. Graph R-CNN for scene graph generation[C]∥European Conference on Computer Vision, Munich, Germany, 2018: 690-706.
[27]	Velikovi P, Cucurull G, Casanova A, et al. Graph attention networks[J]. Stat, 2017, 1050(20): No.10-48550.
[28]	Andriluka M, Pishchulin L, Gehler P, et al. 2D human pose estimation: New benchmark and state of the art analysis[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 3686-3693.
[29]	Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C]∥Proceedings of the European Conference on Computer Vision(ECCV), Zurich, the Switzerland, 2014: 740-755.
[30]	Zhang K, He P, Yao P, et al. Learning enhanced resolution-wise features for human pose estimation[C]∥IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 2020: 2256-2260.
[31]	Wang R, Wu W Y, Wang X Y. Enhancing multi-scale information exchange and feature fusion for human pose estimation[J]. The Visual Computer, 2023, 39(10): 4751-4765.
[32]	Tran T D, Vo X T, Nguyen D L, et al. High-resolution network with attention module for human pose estimation[C]∥Asian Control Conference, Jeju Island, South Korea, 2022: 459-464.
[33]	Dong K W, Sun Y J, Cheng X Z, et al. Combining detailed appearance and multi-scale representation: A structure-context complementary network for human pose estimation[J]. Applied Intelligence, 2023, 53(7): 8097-8113.
[34]	Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild[J/OL].[2023-08-16]. .

相关文章 15

[1]	姚宗伟,陈辰,高振云,靳鸿鹏,荣浩,李学飞,黄虹溥,毕秋实. 基于合成图像数据集的挖掘机关键点识别[J]. 吉林大学学报(工学版), 2026, 56(1): 76-85.
[2]	侯越,张鑫,武月. 基于时空动态约束图反馈的交通流预测[J]. 吉林大学学报(工学版), 2026, 56(1): 183-198.
[3]	李家宝,王成军,苏文杭. 基于自适应参数化非极大值抑制的二维人体姿态估计算法[J]. 吉林大学学报(工学版), 2025, 55(7): 2425-2433.
[4]	侯越,郭劲松,林伟,张迪,武月,张鑫. 分割可跨越车道分界线的多视角视频车速提取方法[J]. 吉林大学学报(工学版), 2025, 55(5): 1692-1704.
[5]	刘广文,赵绮莹,王超,高连宇,才华,付强. 基于渐进递归的生成对抗单幅图像去雨算法[J]. 吉林大学学报(工学版), 2025, 55(4): 1363-1373.
[6]	才华,朱瑞昆,付强,王伟刚,马智勇,孙俊喜. 基于隐式关键点互联的人体姿态估计矫正器算法[J]. 吉林大学学报(工学版), 2025, 55(3): 1061-1071.
[7]	程德强,刘规,寇旗旗,张剑英,江鹤. 基于自适应大核注意力的轻量级图像超分辨率网络[J]. 吉林大学学报(工学版), 2025, 55(3): 1015-1027.
[8]	刘广文,谢欣月,付强,才华,王伟刚,马智勇. 基于时空模板焦点注意的Transformer目标跟踪算法[J]. 吉林大学学报(工学版), 2025, 55(3): 1037-1049.
[9]	才华,郑延阳,付强,王晟宇,王伟刚,马智勇. 基于多尺度候选融合与优化的三维目标检测算法[J]. 吉林大学学报(工学版), 2025, 55(2): 709-721.
[10]	姜来为,王策,杨宏宇. 基于深度学习的多目标跟踪研究进展综述[J]. 吉林大学学报(工学版), 2025, 55(11): 3429-3445.
[11]	朱圣杰,王宣,徐芳,彭佳琦,王远超. 机载广域遥感图像的尺度归一化目标检测方法[J]. 吉林大学学报(工学版), 2024, 54(8): 2329-2337.
[12]	才华,寇婷婷,杨依宁,马智勇,王伟刚,孙俊喜. 基于轨迹优化的三维车辆多目标跟踪[J]. 吉林大学学报(工学版), 2024, 54(8): 2338-2347.
[13]	赖丹晖,罗伟峰,袁旭东,邱子良. 复杂环境下多模态手势关键点特征提取算法[J]. 吉林大学学报(工学版), 2024, 54(8): 2288-2294.
[14]	井佩光,田雨豆,汪少初,李云,苏育挺. 基于动态扩散图卷积的交通流量预测算法[J]. 吉林大学学报(工学版), 2024, 54(6): 1582-1592.
[15]	孙铭会,薛浩,金玉波,曲卫东,秦贵和. 联合时空注意力的视频显著性预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1767-1776.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Baseline	LIEM	VGNM	SGCN	HAU&SGCN	头部	肩部	肘部	腕部	臀部	膝盖	脚踝	平均值
√					97.1	95.9	90.3	86.4	89.1	87.1	83.3	90.3
√	√				97.3	96.0	90.3	86.4	89.3	87.1	83.2	90.3
√	√	√			97.4	96.1	90.7	86.5	89.4	87.4	83.4	90.4
√	√	√	√		97.6	96.1	90.8	86.5	89.4	87.4	83.5	90.5
√	√	√		√	97.6	96.3	90.8	86.7	89.5	87.4	83.6	90.6

Baseline	LIEM	VGNM	SGCN	HAU&SGCN	参数量 /M	运算量 /G	AP/%	AP^0.5 /%	AP^0.75 /%	AP^M /%	AP^L /%	AR /%
√					28.5	7.1	74.4	90.5	81.9	70.8	81.0	79.8
√	√				28.7	7.2	74.9	91.1	82.0	71.6	81.4	79.9
√	√	√			29.4	7.5	75.6	91.8	82.3	72.2	81.8	80.2
√	√	√	√		30.1	7.8	76.2	92.5	82.7	72.4	82.0	80.6
√	√	√		√	30.3	7.9	76.5	92.7	82.8	72.5	82.3	80.8

方法	输入尺寸	参数量/M	运算量/G	AP/%	AP^0.5/%	AP^0.75/%	AP^M/%	AP^L/%	AR/%
8-stage Hourglass^［6］	256×192	25.1	14.3	66.9	—	—	—	—	—
CPN50^［7］	256×192	27.0	6.2	68.6	—	—	—	—	—
CPN50+OHKM^［7］	256×192	27.0	6.2	69.4	—	—	—	—	—
Simple Baseline152^［8］	256×192	68.6	15.7	72.0	89.3	79.8	68.7	78.9	77.8
HRNetW32^［9］	256×192	28.5	7.1	74.4	90.5	81.9	70.8	81.0	79.8
HRNetW48^［9］	256×192	63.6	14.6	75.1	90.6	82.2	71.5	81.8	80.4
TokenPose-L/D24^［12］	256×192	27.5	11.0	75.8	90.3	82.5	72.3	82.7	80.9
HRFormer-B^［13］	256×192	43.2	12.2	75.6	90.8	82.8	71.7	82.6	80.8
RAM-GPRNet（W32）^［30］	256×192	31.4	7.7	76.0	—	—	—	—	—
RAM-GPRNet（W48）^［30］	256×192	70.0	15.8	76.5	—	—	—	—	—
EMF-HRNet^［31］	256×192	28.8	9.5	75.6	90.4	82.6	72.0	82.4	80.8
AMHRNet（W32）^［32］	256×192	36.4	—	76.1	91.0	82.7	71.5	82.9	81.2
AMHRNet（W48）^［32］	256×192	71.8	—	76.4	91.1	83.1	72.2	83.3	81.4
SCC-Net^［33］	256×192	58.9	10.5	73.4	92.6	81.5	70.4	77.5	76.2
Ours（W32）	256×192	30.3	7.9	76.5	92.7	82.8	72.5	82.3	80.8
Ours（W48）	256×192	66.2	17.6	77.2	93.0	83.3	72.9	82.7	81.3
CPN50^［7］	384×288	—	13.9	70.6	—	—	—	—	—
CPN50+OHKM^［7］	384×288	—	13.9	71.6	—	—	—	—	—
Simple Baseline152^［8］	384×288	68.6	35.6	74.3	89.6	81.1	70.5	79.7	79.7
HRNetW32^［9］	384×288	28.5	16.0	75.8	90.6	82.7	71.9	82.8	81.0
HRNetW48^［9］	384×288	63.6	32.9	76.3	90.8	82.9	72.3	83.4	81.2
HRFormer-B^［13］	384×288	43.2	26.8	77.2	91.0	83.6	73.2	84.2	82.0
RAM-GPRNet（W32）^［30］	384×288	31.4	17.2	77.3	—	—	—	—	—
RAM-GPRNet（W48）^［30］	384×288	70.0	35.6	77.7	—	—	—	—	—
EMF-HRNet^［31］	384×288	28.8	—	76.5	90.7	83.1	72.7	83.6	81.5
Ours（W32）	384×288	30.3	18.6	78.0	93.1	83.5	73.1	82.9	81.4
Ours（W48）	384×288	66.2	37.4	78.4	93.3	83.6	73.4	83.7	81.7

方法	输入尺寸	参数量/M	运算量/G	AP/%	AP^0.5/%	AP^0.75/%	AP^M/%	AP^L/%	AR/%
CPN50^［6］	384×288	—	—	72.6	86.1	69.7	78.3	64.1	—
Simple Baseline152^［7］	384×288	68.6	35.6	73.7	91.9	81.1	70.3	80.0	79.0
HRNet（W32）^［8］	384×288	28.5	16.0	74.9	92.5	82.8	71.3	80.9	80.1
HRNet（W48）^［8］	384×288	63.6	32.9	75.5	92.5	83.3	71.9	81.5	80.5
TokenPose-L/D24^［12］	384×288	29.8	22.1	75.9	92.3	83.4	72.2	82.1	80.8
HRFormer-B^［13］	384×288	43.2	26.8	76.2	92.7	83.8	72.5	82.3	81.2
RAM-GPRNet（W32）^［30］	384×288	31.4	17.2	76.5	—	—	—	—	—
RAM-GPRNet（W48） ^［30］	384×288	70.0	35.6	77.0	—	—	—	—	—
Ours（W32）	384×288	30.3	18.6	77.6	92.9	83.4	72.8	82.5	81.2
Ours（W48）	384×288	66.2	37.4	78.1	93.0	83.6	73.2	83.1	81.5