吉林大学学报(工学版) ›› 2025, Vol. 55 ›› Issue (4): 1384-1395.doi: 10.13229/j.cnki.jdxbgxb.20230740

• 计算机科学与技术 • 上一篇    下一篇

基于注意力机制和特征融合的语义分割网络

才华1(),王玉瑶1,付强2,马智勇3,王伟刚3,张晨洁1   

  1. 1.长春理工大学 电子信息工程学院,长春 130022
    2.长春理工大学 空间光电技术研究所,长春 130022
    3.吉林大学第一医院 泌尿外二科,长春 130061
  • 收稿日期:2023-07-15 出版日期:2025-04-01 发布日期:2025-06-19
  • 作者简介:才华(1977-),男,副教授,博士.研究方向:图像处理,机器视觉与人工智能算法.E-mail: caihua@cust.edu.cn
  • 基金资助:
    国家自然科学基金重大项目(61890963);吉林省科技发展计划项目(20210204099YY);吉林省科技发展计划项目(20240302089GX)

Semantic segmentation network based on attention mechanism and feature fusion

Hua CAI1(),Yu-yao WANG1,Qiang FU2,Zhi-yong MA3,Wei-gang WANG3,Chen-jie ZHANG1   

  1. 1.School of Electronic Information Engineer,Changchun University of Science and Technology,Changchun 130022,China
    2.School of Opto-Electronic Engineer,Changchun University of Science and Technology,Changchun 130022,China
    3.No. 2 Department of Urology,The First Hospital of Jilin University,Changchun 130061,China
  • Received:2023-07-15 Online:2025-04-01 Published:2025-06-19

摘要:

针对DeepLabv3+网络中的多尺度目标分割错误、多尺度特征图及不同阶段特征图之间关联性差的问题,提出在DeepLabv3+基础上引入全局上下文注意力模块、级联自适应尺度感知模块及注意力优化融合模块。将全局上下文注意力模块嵌入骨干网络特征提取的初始阶段,获取丰富的上下文信息;级联自适应尺度感知模块可建模多尺度特征之间的依赖性,使其更加关注目标特征;注意力优化融合模块通过多条支路融合多层特征,以此提高解码时像素的连续性。改进网络在Cityscapes数据集以及PASCAL VOC2012增强数据集上进行验证测试,实验结果表明:该网络能弥补DeepLabv3+的不足,且平均交并比分别达到76.2%、78.7%。

关键词: 语义分割, 多尺度特征, 上下文信息, 注意力机制, 特征融合

Abstract:

To address the issues of multi-scale object segmentation errors, poor correlation between multi-scale feature maps and feature maps at different stages in the DeepLabv3+ network, the following modules are proposed to incorporate,including a global context attention module, a cascade adaptive Scale awareness module, and an attention optimized fusion module. The global context attention module is embedded in the initial stage of the backbone network for feature extraction, allowing it to capture rich contextual information. The cascade adaptive scale awareness module models the dependencies between multi-scale features, enabling a stronger focus on the features relevant to the target. The attention optimized fusion module merges multiple layers of features through multiple pathways to enhance pixel continuity during decoding. The improved network is validated on the CityScapes dataset and PASCAL VOC2012 augmented dataset, and the experimental results demonstrate its ability to overcome the limitations of DeepLabv3+. Furthermore, the mean intersection over union reaches 76.2% and 78.7% respectively.

Key words: semantic segmentation, multi-scale features, contextual information, attention mechanism, feature fusion

中图分类号: 

  • TP391.4

图1

改进分割网络总体结构"

图2

全局上下文注意力模块"

图3

级联自适应尺度感知模块"

图4

注意力优化融合模块"

图5

改进网络中注意力可视化结果图"

图6

改进网络的特征可视化结果图"

表1

不同骨干网络在Cityscapes验证集上的比较"

模型骨干网络MIoU/%

Baseline

+GCAM+CASAM+AOFM1,2

MobileNetV2

MobileNetV2

72.1

72.5

Baseline

+GCAM+CASAM+AOFM1,2

ResNet18

ResNet18

73.2

74.3

Baseline

+GCAM+CASAM+AOFM1,2

ResNet50

ResNet50

73.8

75.6

Baseline

+GCAM+CASAM+AOFM1,2

ResNet101

ResNet101

73.9

76.2

表2

改进网络在Cityscapes验证集上的消融实验"

DeepLabv3+GCAMCASAMAOFM1AOFM2参数量/MMIoU/%
52.773.9
58.274.2
64.375.8
66.276.1
67.576.2

图7

CASAM与ASPP在Cityscapes测试集上的对比"

表3

不同算法在Cityscapes验证集上的对比"

算法骨干网络参数量/MMIoU/%
SegNetVGG1650.8566.7
CCNetResNet5060.871.4
UperNetResNet10186.472.5
OCRNetHRNet72.873.4
DANetResNet5058.973.9
SegformerMiT101.275.9
SETR-MLAT-Small311.876.1
DeepLabv3+ResNet10152.774.2
本文ResNet10167.576.2

表4

不同算法在Cityscapes验证集上的各类别预测结果"

算法道路人行道建筑物墙体围栏交通灯交通标识电线杆植被地形天空行人骑行者汽车卡车公交车火车摩托车自行车MIoU/%
SegNet89.875.481.738.758.760.357.959.870.156.780.672.458.389.662.368.467.651.367.966.7
CCNet91.178.886.350.960.261.566.861.473.258.084.475.259.179.273.783.176.366.271.271.4
UperNet92.479.687.451.258.162.967.463.573.459.186.676.360.385.677.183.177.564.671.472.5
OCRNet93.379.488.151.859.663.768.363.774.258.987.377.861.486.978.484.278.666.772.373.4
DANet93.680.188.347.458.964.569.164.574.160.787.678.360.687.979.986.779.268.673.373.9
Segformer93.880.588.955.461.966.172.666.078.461.387.979.462.790.682.286.981.669.576.475.9
SETR-MLA94.280.389.655.962.366.471.864.377.562.888.780.262.990.482.686.781.969.477.176.1
DeepLabv3+93.880.288.653.259.664.569.563.476.160.487.378.560.389.179.385.679.967.673.774.2
本文94.381.189.456.662.666.771.266.777.262.788.280.762.890.882.787.282.369.575.276.2

图8

不同算法在Cityscapes验证集上的预测结果可视化"

表5

不同算法在PASCAL VOC2012验证集的对比"

算法骨干网络参数量/MFlops/GMIoU/%
SegNetVGG1620.3222.370.1
CCNetResNet5047.260.171.5
UperNetResNet10153.464.572.4
OCRNetResNet10169.468.473.7

DANet

SETR-MLA

ResNet50

T-Small

10.34

180.62

51.4

65.2

76.3

78.3

SegformerMiT100.463.978.6
DeepLabv3+ResNet10132.756.672.9
本文ResNet10140.262.378.7

图9

不同网络在PASCAL VOC2012验证集的预测可视化"

[1] Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation[C]∥Medical Image Computing and Computer-Assisted Intervention-MICCAI: The 18th International Conference, Munich, Germany, 2015: 234-241.
[2] Chen J, Lu Y, Yu Q, et al. Transunet: transformers make strong encoders for medical image segmentation[J/OL]. [2023-07-02].arXiv preprint arXiv: 2102. 04306v1.
[3] Zhao T Y, Xu J D, Chen R, et al. Remote sensing image segmentation based on the fuzzy deep convolutional neural network[J]. International Journal of Remote Sensing, 2021, 42(16): 6264-6283.
[4] Yuan X H, Shi J F, Gu L C. A review of deep learning methods for semantic segmentation of remote sensing imagery[J]. Expert Systems with Applications, 2021, 169: No.114417.
[5] Xu Z Y, Zhang W, Zhang T X, et al. Efficient transformer for remote sensing image segmentation[J]. Remote Sensing, 2021, 13(18): No.3585.
[6] Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.
[7] Yu C, Gao C, Wang J, et al. Bisenet v2: bilateral network with guided aggregation for real-time semantic segmentation[J]. International Journal of Computer Vision, 2021, 129: 3051-3068.
[8] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA,2015: 3431-3440.
[9] Chen L C, Papandreou G Kokkinos I, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848.
[10] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation[J/OL].[2023-07-03]. arXiv preprint arXiv: 1706. 05587v3.
[11] Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]∥Proceedings of the European conference on computer vision (ECCV),Munich, Germany,2018: 833-851.
[12] Wang J, Sun K, Cheng T, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3349-3364.
[13] Liu Z, Lin Y, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Montreal, Canada, 2021: 10012-10022.
[14] Wang W, Xie E, Li X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Montreal, Canada, 2021: 568-578.
[15] Zheng S, Lu J, Zhao H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA,2021: 6881-6890.
[16] Xie E, Wang W, Yu Z, et al. SegFormer: simple and efficient design for semantic segmentation with transformers[J]. Advances in Neural Information Processing Systems, 2021, 34: 12077-12090.
[17] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J/OL]. [2023-07-04].arXiv preprint arXiv: 2010. 11929v2.
[18] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 2881-2890.
[19] Hou Q, Zhang L, Cheng M M, et al. Strip pooling: rethinking spatial pooling for scene parsing[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle, USA, 2020: 4003-4012.
[20] Peng C, Zhang X, Yu G, et al. Large kernel matters-improve semantic segmentation by global convolutional network[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 4353-4361.
[21] Ding X, Zhang X, Han J, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNNs[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,New Orleans, USA, 2022: 11963-11975.
[22] Guo M H, Lu C Z, Liu Z N, et al. Visual attention network[J/OL]. [2023-07-04].arXiv preprint arXiv:.
[23] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7132-7141.
[24] Wang Q, Wu B, Zhu P, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle, USA, 2020: 11534-11542.
[25] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019: 3146-3154.
[26] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas, USA, 2016: 3213-3223.
[27] Everingham M, Eslami S M A, Van Gool L, et al. The pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111: 98-136.
[28] 王雪, 李占山, 吕颖达. 基于多尺度感知和语义适配的医学图像分割算法[J]. 吉林大学学报: 工学版, 2022, 52(3): 640-647.
Wang Xue, Li Zhan-shan, Ying-da Lyu. Medical image segmentation algorithm based on multi-scale perception and semantic adaptation [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 640-647.
[1] 薛雅丽,俞潼安,崔闪,周李尊. 基于级联嵌套U-Net的红外小目标检测[J]. 吉林大学学报(工学版), 2025, 55(5): 1714-1721.
[2] 程光,李沛霖. 基于MSE改进BiLSTM网络算法的工业互联网异常流量时空融合检测[J]. 吉林大学学报(工学版), 2025, 55(4): 1406-1411.
[3] 张河山,范梦伟,谭鑫,郑展骥,寇立明,徐进. 基于改进YOLOX的无人机航拍图像密集小目标车辆检测[J]. 吉林大学学报(工学版), 2025, 55(4): 1307-1318.
[4] 李学军,权林霏,刘冬梅,于树友. 基于Faster-RCNN改进的交通标志检测算法[J]. 吉林大学学报(工学版), 2025, 55(3): 938-946.
[5] 程德强,刘规,寇旗旗,张剑英,江鹤. 基于自适应大核注意力的轻量级图像超分辨率网络[J]. 吉林大学学报(工学版), 2025, 55(3): 1015-1027.
[6] 李扬,李现国,苗长云,徐晟. 基于双分支通道先验和Retinex的低照度图像增强算法[J]. 吉林大学学报(工学版), 2025, 55(3): 1028-1036.
[7] 张兰芳,李根泽,刘婷宇,余博. 局部多车影响下跟驰行为机理及建模[J]. 吉林大学学报(工学版), 2025, 55(3): 963-973.
[8] 刘元宁,臧子楠,张浩,刘震. 基于深度学习的核糖核酸二级结构预测方法[J]. 吉林大学学报(工学版), 2025, 55(1): 297-306.
[9] 郭晓然,王铁君,闫悦. 基于局部注意力和本地远程监督的实体关系抽取方法[J]. 吉林大学学报(工学版), 2025, 55(1): 307-315.
[10] 李路,宋均琦,朱明,谭鹤群,周玉凡,孙超奇,周铖钰. 基于RGHS图像增强和改进YOLOv5网络的黄颡鱼目标提取[J]. 吉林大学学报(工学版), 2024, 54(9): 2638-2645.
[11] 郭昕刚,程超,沈紫琪. 基于卷积网络注意力机制的人脸表情识别[J]. 吉林大学学报(工学版), 2024, 54(8): 2319-2328.
[12] 余萍,赵康,曹洁. 基于优化A-BiLSTM的滚动轴承故障诊断[J]. 吉林大学学报(工学版), 2024, 54(8): 2156-2166.
[13] 孙铭会,薛浩,金玉波,曲卫东,秦贵和. 联合时空注意力的视频显著性预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1767-1776.
[14] 高云龙,任明,吴川,高文. 基于注意力机制改进的无锚框舰船检测模型[J]. 吉林大学学报(工学版), 2024, 54(5): 1407-1416.
[15] 李晓旭,安文娟,武继杰,李真,张珂,马占宇. 通道注意力双线性度量网络[J]. 吉林大学学报(工学版), 2024, 54(2): 524-532.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李寿涛, 李元春. 在未知环境下基于递阶模糊行为的移动机器人控制算法[J]. 吉林大学学报(工学版), 2005, 35(04): 391 -397 .
[2] 刘庆民,王龙山,陈向伟,李国发. 滚珠螺母的机器视觉检测[J]. 吉林大学学报(工学版), 2006, 36(04): 534 -538 .
[3] 李红英;施伟光;甘树才 .

稀土六方Z型铁氧体Ba3-xLaxCo2Fe24O41的合成及电磁性能与吸波特性

[J]. 吉林大学学报(工学版), 2006, 36(06): 856 -0860 .
[4] 张全发,李明哲,孙刚,葛欣 . 板材多点成形时柔性压边与刚性压边方式的比较[J]. 吉林大学学报(工学版), 2007, 37(01): 25 -30 .
[5] .

吉林大学学报(工学版)2007年第4期目录

[J]. 吉林大学学报(工学版), 2007, 37(04): 0 .
[6] 李月英,刘勇兵,陈华 . 凸轮材料的表面强化及其摩擦学特性
[J]. 吉林大学学报(工学版), 2007, 37(05): 1064 -1068 .
[7] 冯浩,席建锋,矫成武 . 基于前视距离的路侧交通标志设置方法[J]. 吉林大学学报(工学版), 2007, 37(04): 782 -785 .
[8] 张和生,张毅,温慧敏,胡东成 . 利用GPS数据估计路段的平均行程时间[J]. 吉林大学学报(工学版), 2007, 37(03): 533 -0537 .
[9] 曲昭伟,陈红艳,李志慧,胡宏宇,魏巍 . 基于单模板的二维场景重建方法[J]. 吉林大学学报(工学版), 2007, 37(05): 1159 -1163 .
[10] 聂建军,杜发荣,高峰 . 存在热漏的内燃机与斯特林联合循环的有限时间的热力学研究[J]. 吉林大学学报(工学版), 2007, 37(03): 518 -0523 .