基于图结构引导和位置信息强化的人体姿态估计

doi:10.13229/j.cnki.jdxbgxb.20240086

Abstract

Abstract:

The high degree of freedom of human limbs often constitutes complex poses in which the key points are prone to occluded， and locating the occluded key points is one of the difficulties in human pose estimation. To this end， this paper proposed a method with a guided graph structure and enhanced key points location information. The method incorporates a location information enhancement module in the HRNet， which can improve the representation of the spatial location information of visible key points. A visual graph neural module is integrated into backbone network to extract relevant features containing key points and exploit the local and global topological connectivity relationships between key points in pixel coordinate space to infer the location information of the occluded key points. Finally， a heatmap aggregation unit and a semantic graph convolutional network are employed to update the affinity weights between key points in the semantic space， which can represent the topological dependencies between key points under the constraints of the skeleton structure and further optimize the estimation of the occluded key points. The proposed model achieves an average accuracy of 78.1% on the COCO2017 test set， and can accurately estimate the occluded key points prone to occlusion in complex poses.

Key words: computer vision, human pose estimation, key points, graph convolution

CLC Number:

TP391.4

Xin GUAN,Zi-jian ZHOU,Qiang LI. Human pose estimation based on graph structure guidance and location information enhancement[J].Journal of Jilin University(Engineering and Technology Edition), 2025, 55(10): 3283-3295.

Figures/Tables 12

Fig.1

Fig.2

Fig.3

Fig.4

Fig.5

Fig.6

Table 1

Table 2

Table 3

Table 4

Fig.7

Fig.8

References 34

[1]	Eduardo RDS, Adams LS, Stoffel R A, et al. Monocular multi-person pose estimation: a survey[J]. Pattern Recognition, 2021, 118: No.108046.
[2]	田皓宇, 马昕, 李贻斌. 基于骨架信息的异常步态识别方法[J]. 吉林大学学报: 工学版, 2022, 52(4): 725-737.
	Tian Hao-yu, Ma Xin, Li Yi-bin. Abnormal gait recognition method based on skeleton information[J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 725-737.
[3]	Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[4]	Toshev A, Szegedy C. DeepPose: Human pose estimation via deep neural networks[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 1653-1660.
[5]	Tompson J, Jain A, Lecun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[C]∥Neural Information Processing Systems,Montreal, Canada, 2014: 1799-1807.
[6]	Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation[C]∥European Conference on Computer Vision, Amsterdam, Netherlands, 2016: 483-499.
[7]	Chen Y L, Wang Z C, Peng Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7103-7112.
[8]	Xiao B, Wu H P, Wei Y C. Simple baselines for human pose estimation and tracking[C]∥European Conference on Computer Vision, Munich, Germany, 2018: 472-487.
[9]	Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation [C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 5686-5796.
[10]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]∥Neural Information Processing Systems(NeurIPS),Long Beach, USA, 2017: 5998-6008.
[11]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]∥International Conference on Learning Representations, Online, 2021.
[12]	Li Y J, Zhang S K, Wang Z C, et al. Tokenpose: Learning keypoint tokens for human pose estimation[C]∥Proceedings of the IEEE International Conference on Computer Vision(ICCV),Montreal, Canda, 2021: 11293-11302.
[13]	Yuan Y H, Fu R, Huang L, et al. Hrformer: high-resolution transformer for dense prediction[J]. Advances in Neural Information Processing Systems, 2021, 34: 7281-7293.
[14]	Yang S, Quan Z B, Nie M, et al. Transpose: Keypoint localization via transformer[C]∥Proceedings of the IEEE International Conference on Computer Vision, Montreal, Canda, 2021: 11782-11792.
[15]	Li G H, Müller M, Thabet A, et al. DeepGCNs: Can GCNs Go As Deep As CNNs?[C]∥IEEE International Conference on Computer Vision, Seoul, South Korea, 2019: 9266-9275.
[16]	Qiu L T, Zhang X Y Y, Li Y R, et al. Peeking into occluded joints: A novel framework for crowd pose estimation[C]∥European Conference on Computer Vision, Glasgow, UK, 2020: 488-504.
[17]	Bin Y R, Chen Z M, Wei X S, et al. Structure-aware human pose estimation with graph convolutional networks[J]. Pattern Recognition, 2020, 106: No.107410.
[18]	Wang J, Long X, Gao Y, et al. Graph-PCNN: Two stage human pose estimation with graph pose refinement[C]∥European Conference on Computer Vision, Glasgow, UK, 2020: 492-508.
[19]	Banik S, GarcÍa A M, Knoll A. 3D human pose regression using graph convolutional network[C]∥IEEE International Conference on Image Processing(ICIP), Anchorage, USA, 2021: 924-928.
[20]	Hou Q B, Zhou D Q, Feng J S. Coordinate attention for efficient mobile network design[C]∥IEEE Conference on Computer Vision and Pattern Recognition. Nashville, USA, 2021: 13708-13717.
[21]	Han K, Wang Y H, Guo J Y, et al. Vision gnn: An image is worth graph of nodes[J]. Advances in Neural Information Processing Systems, 2022, 35: 8291-8303.
[22]	Huang G, Liu Z, Laurens V D M, et al. Densely connected convolutional networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Honolulu, USA, 2017: 4700-4708.
[23]	Ding X H, Guo Y C, Ding G G, et al. ACNet: Strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks[C]∥International Conference on Computer Vision, Seoul, South Korea, 2019: 1911-1920.
[24]	Zhao L, Peng X, Tian Y, et al. Semantic graph convolutional networks for 3D Human Pose Regression[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3420-3430.
[25]	Wang X L, Girshick R, Gupta A, et al. Non-local neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Salt Lake City, USA, 2018: 7794-7803.
[26]	Yang J W, Lu J S, Lee S, et al. Graph R-CNN for scene graph generation[C]∥European Conference on Computer Vision, Munich, Germany, 2018: 690-706.
[27]	Velikovi P, Cucurull G, Casanova A, et al. Graph attention networks[J]. Stat, 2017, 1050(20): No.10-48550.
[28]	Andriluka M, Pishchulin L, Gehler P, et al. 2D human pose estimation: New benchmark and state of the art analysis[C]∥IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, 2014: 3686-3693.
[29]	Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C]∥Proceedings of the European Conference on Computer Vision(ECCV), Zurich, the Switzerland, 2014: 740-755.
[30]	Zhang K, He P, Yao P, et al. Learning enhanced resolution-wise features for human pose estimation[C]∥IEEE International Conference on Image Processing, Abu Dhabi, United Arab Emirates, 2020: 2256-2260.
[31]	Wang R, Wu W Y, Wang X Y. Enhancing multi-scale information exchange and feature fusion for human pose estimation[J]. The Visual Computer, 2023, 39(10): 4751-4765.
[32]	Tran T D, Vo X T, Nguyen D L, et al. High-resolution network with attention module for human pose estimation[C]∥Asian Control Conference, Jeju Island, South Korea, 2022: 459-464.
[33]	Dong K W, Sun Y J, Cheng X Z, et al. Combining detailed appearance and multi-scale representation: A structure-context complementary network for human pose estimation[J]. Applied Intelligence, 2023, 53(7): 8097-8113.
[34]	Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild[J/OL].[2023-08-16]. .

Related Articles 15

[1]	Yue HOU,Xin ZHANG,Yue WU. Traffic flow prediction based on spatio-temporal dynamic constraint graph feedback [J]. Journal of Jilin University(Engineering and Technology Edition), 2026, 56(1): 183-198.
[2]	Yue HOU,Jin-song GUO,Wei LIN,Di ZHANG,Yue WU,Xin ZHANG. Multi-view video speed extraction method that can be segmented across lane demarcation lines [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1692-1704.
[3]	Hua CAI,Rui-kun ZHU,Qiang FU,Wei-gang WANG,Zhi-yong MA,Jun-xi SUN. Human pose estimation corrector algorithm based on implicit key point interconnection [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(3): 1061-1071.
[4]	Guang-wen LIU,Xin-yue XIE,Qiang FU,Hua CAI,Wei-gang WANG,Zhi-yong MA. Spatiotemporal Transformer with template attention for target tracking [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(3): 1037-1049.
[5]	Lai-wei JIANG,Ce WANG,Hong-yu YANG. Review of multi-object tracking based on deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(11): 3429-3445.
[6]	Sheng-jie ZHU,Xuan WANG,Fang XU,Jia-qi PENG,Yuan-chao WANG. Multi-scale normalized detection method for airborne wide-area remote sensing images [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(8): 2329-2337.
[7]	Pei-guang JING,Yu-dou TIAN,Shao-chu WANG,Yun LI,Yu-ting SU. Traffic flow prediction algorithm based on dynamic diffusion graph convolution [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1582-1592.
[8]	Ming-hui SUN,Hao XUE,Yu-bo JIN,Wei-dong QU,Gui-he QIN. Video saliency prediction with collective spatio-temporal attention [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1767-1776.
[9]	Yun-long GAO,Ming REN,Chuan WU,Wen GAO. An improved anchor-free model based on attention mechanism for ship detection [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(5): 1407-1416.
[10]	Dian-wei WANG,Chi ZHANG,Jie FANG,Zhi-jie XU. UAV target tracking algorithm based on high resolution siamese network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(5): 1426-1434.
[11]	Yu WANG,Kai ZHAO. Postprocessing of human pose heatmap based on sub⁃pixel location [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(5): 1385-1392.
[12]	Lin MAO,Hong-yang SU,Da-wei YANG. Temporal salient attention siamese tracking network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(11): 3327-3337.
[13]	Wen-cai SUN,Xu-ge HU,Zhi-fa YANG,Fan-yu MENG,Wei SUN. Optimization of infrared-visible road target detection by fusing GPNet and image multiscale features [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(10): 2799-2806.
[14]	Yu-ting SU,Ji WANG,Wei ZHAO,Pei-guang JING. Dynamic graph convolutional neural network for image sentiment distribution prediction [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(9): 2601-2610.
[15]	Jing-hong LIU,An-ping DENG,Qi-qi CHEN,Jia-qi PENG,Yu-jia ZUO. Anchor⁃free target tracking algorithm based on multiple attention mechanism [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(12): 3518-3528.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Baseline	LIEM	VGNM	SGCN	HAU&SGCN	头部	肩部	肘部	腕部	臀部	膝盖	脚踝	平均值
√					97.1	95.9	90.3	86.4	89.1	87.1	83.3	90.3
√	√				97.3	96.0	90.3	86.4	89.3	87.1	83.2	90.3
√	√	√			97.4	96.1	90.7	86.5	89.4	87.4	83.4	90.4
√	√	√	√		97.6	96.1	90.8	86.5	89.4	87.4	83.5	90.5
√	√	√		√	97.6	96.3	90.8	86.7	89.5	87.4	83.6	90.6

Baseline	LIEM	VGNM	SGCN	HAU&SGCN	参数量 /M	运算量 /G	AP/%	AP^0.5 /%	AP^0.75 /%	AP^M /%	AP^L /%	AR /%
√					28.5	7.1	74.4	90.5	81.9	70.8	81.0	79.8
√	√				28.7	7.2	74.9	91.1	82.0	71.6	81.4	79.9
√	√	√			29.4	7.5	75.6	91.8	82.3	72.2	81.8	80.2
√	√	√	√		30.1	7.8	76.2	92.5	82.7	72.4	82.0	80.6
√	√	√		√	30.3	7.9	76.5	92.7	82.8	72.5	82.3	80.8

方法	输入尺寸	参数量/M	运算量/G	AP/%	AP^0.5/%	AP^0.75/%	AP^M/%	AP^L/%	AR/%
8-stage Hourglass^［6］	256×192	25.1	14.3	66.9	—	—	—	—	—
CPN50^［7］	256×192	27.0	6.2	68.6	—	—	—	—	—
CPN50+OHKM^［7］	256×192	27.0	6.2	69.4	—	—	—	—	—
Simple Baseline152^［8］	256×192	68.6	15.7	72.0	89.3	79.8	68.7	78.9	77.8
HRNetW32^［9］	256×192	28.5	7.1	74.4	90.5	81.9	70.8	81.0	79.8
HRNetW48^［9］	256×192	63.6	14.6	75.1	90.6	82.2	71.5	81.8	80.4
TokenPose-L/D24^［12］	256×192	27.5	11.0	75.8	90.3	82.5	72.3	82.7	80.9
HRFormer-B^［13］	256×192	43.2	12.2	75.6	90.8	82.8	71.7	82.6	80.8
RAM-GPRNet（W32）^［30］	256×192	31.4	7.7	76.0	—	—	—	—	—
RAM-GPRNet（W48）^［30］	256×192	70.0	15.8	76.5	—	—	—	—	—
EMF-HRNet^［31］	256×192	28.8	9.5	75.6	90.4	82.6	72.0	82.4	80.8
AMHRNet（W32）^［32］	256×192	36.4	—	76.1	91.0	82.7	71.5	82.9	81.2
AMHRNet（W48）^［32］	256×192	71.8	—	76.4	91.1	83.1	72.2	83.3	81.4
SCC-Net^［33］	256×192	58.9	10.5	73.4	92.6	81.5	70.4	77.5	76.2
Ours（W32）	256×192	30.3	7.9	76.5	92.7	82.8	72.5	82.3	80.8
Ours（W48）	256×192	66.2	17.6	77.2	93.0	83.3	72.9	82.7	81.3
CPN50^［7］	384×288	—	13.9	70.6	—	—	—	—	—
CPN50+OHKM^［7］	384×288	—	13.9	71.6	—	—	—	—	—
Simple Baseline152^［8］	384×288	68.6	35.6	74.3	89.6	81.1	70.5	79.7	79.7
HRNetW32^［9］	384×288	28.5	16.0	75.8	90.6	82.7	71.9	82.8	81.0
HRNetW48^［9］	384×288	63.6	32.9	76.3	90.8	82.9	72.3	83.4	81.2
HRFormer-B^［13］	384×288	43.2	26.8	77.2	91.0	83.6	73.2	84.2	82.0
RAM-GPRNet（W32）^［30］	384×288	31.4	17.2	77.3	—	—	—	—	—
RAM-GPRNet（W48）^［30］	384×288	70.0	35.6	77.7	—	—	—	—	—
EMF-HRNet^［31］	384×288	28.8	—	76.5	90.7	83.1	72.7	83.6	81.5
Ours（W32）	384×288	30.3	18.6	78.0	93.1	83.5	73.1	82.9	81.4
Ours（W48）	384×288	66.2	37.4	78.4	93.3	83.6	73.4	83.7	81.7

方法	输入尺寸	参数量/M	运算量/G	AP/%	AP^0.5/%	AP^0.75/%	AP^M/%	AP^L/%	AR/%
CPN50^［6］	384×288	—	—	72.6	86.1	69.7	78.3	64.1	—
Simple Baseline152^［7］	384×288	68.6	35.6	73.7	91.9	81.1	70.3	80.0	79.0
HRNet（W32）^［8］	384×288	28.5	16.0	74.9	92.5	82.8	71.3	80.9	80.1
HRNet（W48）^［8］	384×288	63.6	32.9	75.5	92.5	83.3	71.9	81.5	80.5
TokenPose-L/D24^［12］	384×288	29.8	22.1	75.9	92.3	83.4	72.2	82.1	80.8
HRFormer-B^［13］	384×288	43.2	26.8	76.2	92.7	83.8	72.5	82.3	81.2
RAM-GPRNet（W32）^［30］	384×288	31.4	17.2	76.5	—	—	—	—	—
RAM-GPRNet（W48） ^［30］	384×288	70.0	35.6	77.0	—	—	—	—	—
Ours（W32）	384×288	30.3	18.6	77.6	92.9	83.4	72.8	82.5	81.2
Ours（W48）	384×288	66.2	37.4	78.1	93.0	83.6	73.2	83.1	81.5

Human pose estimation based on graph structure guidance and location information enhancement

RICH HTML

PDF (PC)