频率和空间特征融合的轻量级多尺度遥感图像场景分类网络

doi:10.13229/j.cnki.jdxbgxb.20240054

摘要/Abstract

摘要：

针对遥感图像分类任务中土地覆盖物尺寸和空间组合多种多样、类间相似性高和类内差异性大的问题，从特征的有效提取和多尺度特征的充分融合出发，设计了一种频率和空间特征融合的轻量级多尺度遥感图像场景分类网络（FS-LMFFNet）。首先，为了结合卷积神经网络（CNN）和Transformer的优点，实现局部和全局特征的充分提取，提出了一种频率和空间多层感知机模块（FS-MLP），该模块通过引入频域分析，补充了传统空间操作在提取全局高频纹理特征方面的不足。其次，针对遥感场景图像的多尺度特性，提出了一种轻量级多层特征融合模块（LMFF），该模块采用轻量级卷积块对前3个阶段的多尺度特征进行有效的融合。最后，基于FS-MLP和LMFF模块构建的FS-LMFFNet在3个公开数据集UC_Merced、RSSCN7和AID上进行实验，准确率分别达到99.10%、96.60%和95.48%。实验结果表明，本文提出的FS-LMFFNet能更好地提取和融合多尺度特征，从而取得优于其他先进模型的性能。

关键词: 遥感图像, 深度学习, 卷积神经网络（CNN）, 快速傅里叶变换（FFT）, 多尺度特征融合

Abstract:

To address the issues of diverse land cover sizes and spatial combinations， as well as significant interclass similarity and intraclass variability in remote sensing image classification tasks， a lightweight frequency and spatial feature fused multi-scale remote sensing scene classification network（FS-LMFFNet） is proposed， based on the purpose of effective feature extraction and full integration of multi-scale features. Firstly， to combine the advantages of CNN and Transformer， and achieve an adequate extraction of local and global features， a Frequency and Spatial MLP module（FS-MLP） is proposed， which complements traditional spatial operations in extracting global high-frequency texture features by introducing frequency domain analysis. Secondly， to resolve the multi-scale characteristics of remote sensing scene images， a Lightweight Multi-layer Feature Fusion（LMFF） module is proposed， in which lightweight convolutional blocks are employed to efficiently fuse the multi-scale features in the first three stages. Finally， FS-LMFFNet has been extensively experimented on three publicly available datasets UC_Merced， RSSCN7 and AID datasets and yielded remarkable accuracies of 99.10%， 96.60% and 95.48%， respectively. Experimental results demonstrate the superior multi-scale feature extraction and fusion capability of FS-LMFFNet， which achieves better performance than other state-of-the-art models.

Key words: remote sensing images, deep learning, convolutional neural network（CNN）, fast Fourier transform（FFT）, multi-scale feature fusion

中图分类号:

TP391.4

王威,孙钰洁,王新. 频率和空间特征融合的轻量级多尺度遥感图像场景分类网络[J]. 吉林大学学报(工学版), 2025, 55(10): 3361-3371.

Wei WANG,Yu-jie SUN,Xin WANG. Lightweight frequency and spatial feature fused multi-scale remote sensing scene classification network[J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(10): 3361-3371.

图/表 11

图1

图2

表1

图3

图4

表2

表3

LMFF下采样操作的消融实验"

下采样方法	参数量/M	计算量/G	准确率/%
（-，-，下采样块）	2.67	1.14	98.16±0.30
$M a x P 28, M a x P 24, M a x P 22$	2.44	1.12	98.70±0.26
$M a x P 88, M a x P 44, M a x P 22$	2.43	1.12	99.10±0.22

表3

表4

表5

图5

图6

参考文献 36

[1]	徐从安, 吕亚飞, 张筱晗, 等. 基于双重注意力机制的遥感图像场景分类特征表示方法[J]. 电子与信息学报, 2021, 43(3): 683-691.
	Xu Cong-an, Ya-fei Lyu, Zhang Xiao-han, et al. A discriminative feature representation method based on dual attention mechanism for remote sensing image scene classification[J]. Journal of Electronics & Information Technology, 2021, 43(3): 683-691.
[2]	Morell-Monzó S, Sebastiá-Frasquet M T, Estornell J. Land use classification of VHR images for mapping small-sized abandoned citrus plots by using spectral and textural information[J]. Remote Sensing, 2021, 13(4): No.681.
[3]	Liang S, Cheng J, Zhang J. Maximum likelihood classification of soil remote sensing image based on deep learning[J]. Earth Sciences Research Journal, 2020, 24(3): 357-365.
[4]	Fatemighomi H S, Golalizadeh M, Amani M. Object-based hyperspectral image classification using a new latent block model based on hidden Markov random fields[J]. Pattern Anal Applic, 2022, 25: 467-481.
[5]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[6]	Dai J, Qi H, Xiong Y, et al. Deformable convolutional networks[C]∥Proceedings of the IEEE International Conference on Computer Vision(ICCV), Venice, Italy, 2017: 764-773.
[7]	Ding X, Zhang X, Han J, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNNs[C]∥Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR), New Orleans, Louisiana, USA,2022: 11953-11965.
[8]	Guo M H, Lu C Z, Liu Z N, et al. Visual attention network[J]. Computational Visual Media, 2022, 9(4):733-752.
[9]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]∥Proceedings of 31st Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000-6010.
[10]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2022-10-18]..
[11]	Bazi Y, Bashmal L, Rahhal M M A, et al. Vision transformers for remote sensing image classification[J]. Remote Sensing, 2021, 13(3): No. 516.
[12]	Yu W H, Luo M, Zhou P, et al. Meta former is actually what you need for vision[C]∥Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10809-10819.
[13]	王威, 李希杰, 王新. ADC-CPANet: 一种局部-全局特征融合的遥感图像分类方法[J]. 遥感学报, 2024, 28(10): 2661-2672.
	Wang Wei, Li Xi-jie, Wang Xin. ADC-CPANet:a remote sensing image classification method based on local-global feature fusion[J]. National Remote Sensing Bulletin, 2024, 28(10): 2661-2672.
[14]	Wang W, Hu T, Wang X, et al. BFRNet: bidimensional feature representation network for remote sensing images classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-13.
[15]	Huang Z, Zhang Z, Lan C, et al. Adaptive frequency filters as efficient global token mixers[EB/OL].[2023-03-22]. .
[16]	Cao R, Fang L, Lu T, et al. Self-attention -based deep feature fusion for remote sensing scene classification[J]. IEEE Geoscience and Remote Sensing Letters, 2021, 18(1): 43-47.
[17]	王威, 邓纪伟, 王新, 等. 面向遥感图像场景分类的GLFFNet模型[J]. 测绘学报, 2023, 52(10): 1693-1702.
	Wang Wei, Deng Ji-wei, Wang Xin, et al. GLFFNet model for remote sensing image scene classification[J]. Acta Geodaetica ET Cartographica Sinica, 2023, 52(10): 1693-1702.
[18]	Hendrycks D, Gimpel K. Gaussian error linear units (GELUs)[EB/OL]. [2024-01-10]. .
[19]	Sandler M, Howard A, Zhu M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]∥IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA,2018:4510-4520.
[20]	Zhou D, Hou Q, Chen Y, et al. Rethinking bottleneck structure for efficient mobile network design[J]. In Computer Vision-ECCV 2020, Lecture Notes in Computer Science, 2020, 12348: 680-697.
[21]	Sergey I, Christian S. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]∥Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015:448-456.
[22]	Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), Montreal, Canada,2021: 13713-13722.
[23]	Yang Y, Shawn N. Bag-of-visual-words and spatial extensions for land-use classification[C]∥Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose California, USA, 2010: 270-279.
[24]	Zou Q, Ni L H, Zhang T, et al. Deep learning based feature selection for remote sensing scene classification[J]. IEEE Geoscience and Remote Sensing Letters, 2015, 12(11): 2321-2325.
[25]	Xia G S, Hu J, Hu F, et al. AID: a benchmark data set for performance evaluation of aerial scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(7): 3965-3981.
[26]	Liu Z, Lin Y, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]∥ IEEE/CVF International Conference on Computer Vision(ICCV), Montreal, Canada, 2021: 10012-10022.
[27]	Cao G, Luo S, Huang W, et al. Strip-MLP: efficient token interaction for vision MLP[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France,2023: 1494-1504.
[28]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2023-03-18]. .
[29]	He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]∥Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770-778.
[30]	Qin Z, Zhang P, Wu F, et al. FcaNet: frequency channel attention networks[C]∥Proceedings of the IEEE International Conference on Computer Vision, Xi'an, China, 2020: 763-772.
[31]	Rao Y, Zhao W, Zhu Z, et al. Global filter networks for image classification[J]. Advances in Neural Information Processing Systems, 2021, 2: 980-993.
[32]	Tang Y, Han K, Guo J, et al. An image patch is a wave: phase-aware vision MLP[EB/OL].[2023-03-18]. .
[33]	Li J, Hassani A, Walton S, et al. ConvMLP: Hierarchical Convolutional MLPs for Vision[C]∥IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Vancouver,Canada, 2023: 6307-6316.
[34]	Wang X, Duan L, Ning C, et al. Relation-attention networks for remote sensing scene classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 422-439.
[35]	Tang X, Li M, Ma J, et al. EMTCAL: efficient multiscale transformer and cross-level attention learning for remote sensing scene classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-15.
[36]	Selvaraju R R, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]∥Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017: 618-626.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

	层	输出大小
阶段一	Tokenizer层	（H/4）×（W/4）
阶段一	FS-MLP模块×2	（H/4）×（W/4）
阶段二	下采样模块	（H/8）×（W/8）
阶段二	FS-MLP模块×2	（H/8）×（W/8）
阶段三	下采样模块	（H/16）×（W/16）
阶段三	FS-MLP模块×8	（H/16）×（W/16）
阶段四	LMFF模块	（H/32）×（W/32）
分类器	归一化、全局池化全连接层	C=预测类别

频率分支	空间分支	原始分支	权重计算	参数量/M	计算量/G	准确率/%
×	Kernel size=7	√	√	1.99	1.09	98.36±0.26
√	×	√	√	2.31	1.07	98.62±0.20
√	Kernel size=3	√	√	2.41	1.11	98.72±0.11
√	Kernel size=7	×	√	2.34	1.09	98.86±0.26
√	Kernel size=7	√	×	2.36	1.12	98.50±0.35
√	Kernel size=7	√	√	2.43	1.12	99.10±0.22

模型	参数量/M	计算量/G	准确率/%
模型	参数量/M	计算量/G	UC_Merced	RSSCN7	AID
ResNet18^［29］	11.15	1.82	98.28±0.25	94.96±0.42	94.50±0.19
FcaNet18^［30］	11.23	1.82	97.96±0.47	94.96±0，52	94.34±0.34
MobileNeXt^［20］	2.86	0.29	97.14±0.46	94.64±0.73	95.08±0.16
MobileNeXt_CA^［22］	3.26	0.29	97.36±0.37	95.10±0.39	95.26±0.11
SwinTransformer_Tiny^［26］	26.96	4.2	95.48±0.45	93.40±0.52	90.20±0.20
VAN_b0^［8］	3.91	0.88	97.62±0.18	94.28±0.41	93.88±0.23
GFNet_PyramidTi^［31］	12.18	1.90	95.18±0.68	91.54±0.44	90.90±0.57
WaveMLP_T^［32］	16.40	2.48	96.86±0.72	92.48±0.41	92.54±0.40
ConvMLP_S^［33］	8.60	2.30	95.24±0.86	94.62±0.28	93.38±0.22
Strip-MLP-T*^［27］	18.24	2.54	98.72±0.16	95.32±0.51	95.12±0.23
SAFF^［16］	14.76	15.38	95.58±0.39	93.78±0.63	94.18±0.17
RaNet^［34］	21.47	3.85	98.38±0.39	95.24±0.15	95.38±0.22
EMTCAL^［35］	27.30	4.23	98.78±0.27	95.32±0.37	94.96±0.19
FS-LMFFNet	2.43	1.12	99.10±0.22	96.60±0.24	95.48±0.13