Journal of Jilin University(Engineering and Technology Edition) ›› 2025, Vol. 55 ›› Issue (4): 1384-1395.doi: 10.13229/j.cnki.jdxbgxb.20230740

Previous Articles     Next Articles

Semantic segmentation network based on attention mechanism and feature fusion

Hua CAI1(),Yu-yao WANG1,Qiang FU2,Zhi-yong MA3,Wei-gang WANG3,Chen-jie ZHANG1   

  1. 1.School of Electronic Information Engineer,Changchun University of Science and Technology,Changchun 130022,China
    2.School of Opto-Electronic Engineer,Changchun University of Science and Technology,Changchun 130022,China
    3.No. 2 Department of Urology,The First Hospital of Jilin University,Changchun 130061,China
  • Received:2023-07-15 Online:2025-04-01 Published:2025-06-19

Abstract:

To address the issues of multi-scale object segmentation errors, poor correlation between multi-scale feature maps and feature maps at different stages in the DeepLabv3+ network, the following modules are proposed to incorporate,including a global context attention module, a cascade adaptive Scale awareness module, and an attention optimized fusion module. The global context attention module is embedded in the initial stage of the backbone network for feature extraction, allowing it to capture rich contextual information. The cascade adaptive scale awareness module models the dependencies between multi-scale features, enabling a stronger focus on the features relevant to the target. The attention optimized fusion module merges multiple layers of features through multiple pathways to enhance pixel continuity during decoding. The improved network is validated on the CityScapes dataset and PASCAL VOC2012 augmented dataset, and the experimental results demonstrate its ability to overcome the limitations of DeepLabv3+. Furthermore, the mean intersection over union reaches 76.2% and 78.7% respectively.

Key words: semantic segmentation, multi-scale features, contextual information, attention mechanism, feature fusion

CLC Number: 

  • TP391.4

Fig.1

Improving the overall network architecture diagram"

Fig.2

Global context attention module"

Fig.3

Cascade adaptive scale awareness module"

Fig.4

Attention Optimization Fusion Module"

Fig.5

Visualization results of attention in the improved network"

Fig.6

Visualization results of feature maps in the improved network"

Table 1

Comparison of different backbone networks on the Cityscapes validation set"

模型骨干网络MIoU/%

Baseline

+GCAM+CASAM+AOFM1,2

MobileNetV2

MobileNetV2

72.1

72.5

Baseline

+GCAM+CASAM+AOFM1,2

ResNet18

ResNet18

73.2

74.3

Baseline

+GCAM+CASAM+AOFM1,2

ResNet50

ResNet50

73.8

75.6

Baseline

+GCAM+CASAM+AOFM1,2

ResNet101

ResNet101

73.9

76.2

Table 2

Ablation experiments of the improved network on the Cityscapes validation set"

DeepLabv3+GCAMCASAMAOFM1AOFM2参数量/MMIoU/%
52.773.9
58.274.2
64.375.8
66.276.1
67.576.2

Fig.7

Comparison of CAMSM and ASPP on the Cityscapes test set"

Table 3

Comparison of different algorithms on the Cityscapes validation set"

算法骨干网络参数量/MMIoU/%
SegNetVGG1650.8566.7
CCNetResNet5060.871.4
UperNetResNet10186.472.5
OCRNetHRNet72.873.4
DANetResNet5058.973.9
SegformerMiT101.275.9
SETR-MLAT-Small311.876.1
DeepLabv3+ResNet10152.774.2
本文ResNet10167.576.2

Table 4

Prediction results for each category of different algorithms on the Cityscapes validation set"

算法道路人行道建筑物墙体围栏交通灯交通标识电线杆植被地形天空行人骑行者汽车卡车公交车火车摩托车自行车MIoU/%
SegNet89.875.481.738.758.760.357.959.870.156.780.672.458.389.662.368.467.651.367.966.7
CCNet91.178.886.350.960.261.566.861.473.258.084.475.259.179.273.783.176.366.271.271.4
UperNet92.479.687.451.258.162.967.463.573.459.186.676.360.385.677.183.177.564.671.472.5
OCRNet93.379.488.151.859.663.768.363.774.258.987.377.861.486.978.484.278.666.772.373.4
DANet93.680.188.347.458.964.569.164.574.160.787.678.360.687.979.986.779.268.673.373.9
Segformer93.880.588.955.461.966.172.666.078.461.387.979.462.790.682.286.981.669.576.475.9
SETR-MLA94.280.389.655.962.366.471.864.377.562.888.780.262.990.482.686.781.969.477.176.1
DeepLabv3+93.880.288.653.259.664.569.563.476.160.487.378.560.389.179.385.679.967.673.774.2
本文94.381.189.456.662.666.771.266.777.262.788.280.762.890.882.787.282.369.575.276.2

Fig.8

Visualization of prediction results of different networks on the Cityscapes validation set"

Table 5

Comparison of different algorithms on the PASCAL VOC2012 validation set"

算法骨干网络参数量/MFlops/GMIoU/%
SegNetVGG1620.3222.370.1
CCNetResNet5047.260.171.5
UperNetResNet10153.464.572.4
OCRNetResNet10169.468.473.7

DANet

SETR-MLA

ResNet50

T-Small

10.34

180.62

51.4

65.2

76.3

78.3

SegformerMiT100.463.978.6
DeepLabv3+ResNet10132.756.672.9
本文ResNet10140.262.378.7

Fig.9

Visualization of prediction results of different networks on the PASCAL VOC2012 validation set"

[1] Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation[C]∥Medical Image Computing and Computer-Assisted Intervention-MICCAI: The 18th International Conference, Munich, Germany, 2015: 234-241.
[2] Chen J, Lu Y, Yu Q, et al. Transunet: transformers make strong encoders for medical image segmentation[J/OL]. [2023-07-02].arXiv preprint arXiv: 2102. 04306v1.
[3] Zhao T Y, Xu J D, Chen R, et al. Remote sensing image segmentation based on the fuzzy deep convolutional neural network[J]. International Journal of Remote Sensing, 2021, 42(16): 6264-6283.
[4] Yuan X H, Shi J F, Gu L C. A review of deep learning methods for semantic segmentation of remote sensing imagery[J]. Expert Systems with Applications, 2021, 169: No.114417.
[5] Xu Z Y, Zhang W, Zhang T X, et al. Efficient transformer for remote sensing image segmentation[J]. Remote Sensing, 2021, 13(18): No.3585.
[6] Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.
[7] Yu C, Gao C, Wang J, et al. Bisenet v2: bilateral network with guided aggregation for real-time semantic segmentation[J]. International Journal of Computer Vision, 2021, 129: 3051-3068.
[8] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA,2015: 3431-3440.
[9] Chen L C, Papandreou G Kokkinos I, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848.
[10] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[J/OL].[2023-07-03]. arXiv preprint arXiv: 1706. 05587v3.
[11] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]∥Proceedings of the European conference on computer vision (ECCV),Munich, Germany,2018: 833-851.
[12] Wang J, Sun K, Cheng T, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3349-3364.
[13] Liu Z, Lin Y, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Montreal, Canada, 2021: 10012-10022.
[14] Wang W, Xie E, Li X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Montreal, Canada, 2021: 568-578.
[15] Zheng S, Lu J, Zhao H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA,2021: 6881-6890.
[16] Xie E, Wang W, Yu Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[J]. Advances in Neural Information Processing Systems, 2021, 34: 12077-12090.
[17] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J/OL]. [2023-07-04].arXiv preprint arXiv: 2010. 11929v2.
[18] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 2881-2890.
[19] Hou Q, Zhang L, Cheng M M, et al. Strip pooling: rethinking spatial pooling for scene parsing[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle, USA, 2020: 4003-4012.
[20] Peng C, Zhang X, Yu G, et al. Large kernel matters-improve semantic segmentation by global convolutional network[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4353-4361.
[21] Ding X, Zhang X, Han J, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNNs[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,New Orleans, USA, 2022: 11963-11975.
[22] Guo M H, Lu C Z, Liu Z N, et al. Visual attention network[J/OL]. [2023-07-04].arXiv preprint arXiv:.
[23] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132-7141.
[24] Wang Q, Wu B, Zhu P, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle, USA, 2020: 11534-11542.
[25] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3146-3154.
[26] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas, USA, 2016: 3213-3223.
[27] Everingham M, Eslami S M A, Van Gool L, et al. The pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111: 98-136.
[28] 王雪, 李占山, 吕颖达. 基于多尺度感知和语义适配的医学图像分割算法[J]. 吉林大学学报: 工学版, 2022, 52(3): 640-647.
Wang Xue, Li Zhan-shan, Ying-da Lyu. Medical image segmentation algorithm based on multi-scale perception and semantic adaptation [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 640-647.
[1] Ya-li XUE,Tong-an YU,Shan CUI,Li-zun ZHOU. Infrared small target detection based on cascaded nested U-Net [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1714-1721.
[2] Guang CHENG,Pei-lin LI. Spatio temporal fusion detection of abnormal traffic in industrial Internet based on MSE improved BiLSTM network algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(4): 1406-1411.
[3] He-shan ZHANG,Meng-wei FAN,Xin TAN,Zhan-ji ZHENG,Li-ming KOU,Jin XU. Dense small object vehicle detection in UAV aerial images using improved YOLOX [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(4): 1307-1318.
[4] Xue-jun LI,Lin-fei QUAN,Dong-mei LIU,Shu-you YU. Improved Faster⁃RCNN algorithm for traffic sign detection [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(3): 938-946.
[5] De-qiang CHENG,Gui LIU,Qi-qi KOU,Jian-ying ZHANG,He JIANG. Lightweight image super⁃resolution network based on adaptive large kernel attention fusion [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(3): 1015-1027.
[6] Yang LI,Xian-guo LI,Chang-yun MIAO,Sheng XU. Low⁃light image enhancement algorithm based on dual branch channel prior and Retinex [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(3): 1028-1036.
[7] Xiao-ran GUO,Tie-jun WANG,Yue YAN. Entity relationship extraction method based on local attention and local remote supervision [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(1): 307-315.
[8] Lu Li,Jun-qi Song,Ming Zhu,He-qun Tan,Yu-fan Zhou,Chao-qi Sun,Cheng-yu Zhou. Object extraction of yellow catfish based on RGHS image enhancement and improved YOLOv5 network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(9): 2638-2645.
[9] Ping YU,Kang ZHAO,Jie CAO. Rolling bearing fault diagnosis based on optimized A-BiLSTM [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(8): 2156-2166.
[10] Yun-zuo ZHANG,Yu-xin ZHENG,Cun-yu WU,Tian ZHANG. Accurate lane detection of complex environment based on double feature extraction network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(7): 1894-1902.
[11] Ming-hui SUN,Hao XUE,Yu-bo JIN,Wei-dong QU,Gui-he QIN. Video saliency prediction with collective spatio-temporal attention [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1767-1776.
[12] Yun-long GAO,Ming REN,Chuan WU,Wen GAO. An improved anchor-free model based on attention mechanism for ship detection [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(5): 1407-1416.
[13] Xiao-xu LI,Wen-juan AN,Ji-jie WU,Zhen LI,Ke ZHANG,Zhan-yu MA. Channel attention bilinear metric network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(2): 524-532.
[14] Bin ZHAO,Cheng-dong WU,Xue-jiao ZHANG,Ruo-huai SUN,Yang JIANG. Target grasping network technology of robot manipulator based on attention mechanism [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3423-3432.
[15] Lin MAO,Hong-yang SU,Da-wei YANG. Temporal salient attention siamese tracking network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(11): 3327-3337.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LI Shoutao, LI Yuanchun. Autonomous Mobile Robot Control Algorithm Based on Hierarchical Fuzzy Behaviors in Unknown Environments[J]. 吉林大学学报(工学版), 2005, 35(04): 391 -397 .
[2] Liu Qing-min,Wang Long-shan,Chen Xiang-wei,Li Guo-fa. Ball nut detection by machine vision[J]. 吉林大学学报(工学版), 2006, 36(04): 534 -538 .
[3] Li Hong-ying; Shi Wei-guang;Gan Shu-cai. Electromagnetic properties and microwave absorbing property
of Z type hexaferrite Ba3-xLaxCo2Fe24O41
[J]. 吉林大学学报(工学版), 2006, 36(06): 856 -0860 .
[4] Zhang Quan-fa,Li Ming-zhe,Sun Gang,Ge Xin . Comparison between flexible and rigid blank-holding in multi-point forming[J]. 吉林大学学报(工学版), 2007, 37(01): 25 -30 .
[5] . [J]. 吉林大学学报(工学版), 2007, 37(04): 0 .
[6] Li Yue-ying,Liu Yong-bing,Chen Hua . Surface hardening and tribological properties of a cam materials[J]. 吉林大学学报(工学版), 2007, 37(05): 1064 -1068 .
[7] Feng Hao,Xi Jian-feng,Jiao Cheng-wu . Placement of roadside traffic signs based on visibility distance[J]. 吉林大学学报(工学版), 2007, 37(04): 782 -785 .
[8] Zhang He-sheng, Zhang Yi, Wen Hui-min, Hu Dong-cheng . Estimation approaches of average link travel time using GPS data[J]. 吉林大学学报(工学版), 2007, 37(03): 533 -0537 .
[9] Qu Zhao-wei,Chen Hong-yan,Li Zhi-hui,Hu Hong-yu,Wei Wei . 2D view reconstruction method based on single calibration pattern
[J]. 吉林大学学报(工学版), 2007, 37(05): 1159 -1163 .
[10] Nie Jian-jun,Du Fa-rong,Gao Feng . Finite time thermodynamics of real combined power cycle operating
between internal combustion engine and Stirling engine with heat leak
[J]. 吉林大学学报(工学版), 2007, 37(03): 518 -0523 .