融合多源时空信息鸟瞰图的未来实例分割预测

doi:10.13229/j.cnki.jdxbgxb.20231460

摘要/Abstract

摘要：

针对现有实例分割存在的难以识别被遮挡对象、对噪声和视角变化鲁棒性不够等问题，提出了一种融合多源时空信息的场景细粒度鸟瞰图生成方法（MSTFB）。该方法首先基于栅格化场景鸟瞰图，采用自注意力机制融合时序鸟瞰图特征，通过时空跨域卷积网络捕获实例间相对位置并聚合多尺度特征，得到场景细粒度鸟瞰图。在此基础上，又提出了一种融合时序编码和样本特征的鸟瞰图实例分割预测方法（ESF-BISP），采用ConvGRU对历史帧进行时序语义编码得到时序特征，通过条件变分自编码器生成当前帧细粒度鸟瞰图的状态特征分布并采样鸟瞰图的样本特征，再利用高斯混合模型融合鸟瞰图时序特征和样本特征，经解码得到未来帧场景细粒度鸟瞰图。在公开数据集nuScenes上的实验结果表明，MSTFB方法和基准算法LSS相比，车辆分割IoU指标提升了7.09%，能有效分割远端车辆和被遮挡车辆；ESF-BISP能更好地捕获场景中动态实例的变化，无论是用于实例分割，还是用于未来实例分割预测，其性能都显著优于基准算法。

关键词: 计算机应用技术, 实例分割预测, 鸟瞰图时序编码, 多视角图像, 时空跨域卷积网络

Abstract:

Aiming at the problems of difficult identification of occluded objects and insufficient robustness to noise and viewing angle changes in existing instance segmentation， this paper proposes a method of multi-source spatio-temporal information based fine-grained bird's-eye view generation（MSTFB）. The method is based on a rasterized scene bird's eye view， the self-attention mechanism is utilized to fuse temporal bird's eye view features to obtain the scene fine-graine bird's eye view， and the spatiotemporal cross-domain convolutional network is employed to capture the relative position information between instances and fuse the multi-scale features. On this basis， a bird's-eye view instance segmentation prediction method of encoding and sample fusion （ESF-BISP） is proposed. ConvGRU is used to encode the time series semantics of the historical frame to obtain the time series features， and CVAE is adopted to model the state feature distribution of the current frame fine-grained bird's eye view and sample the bird's eye view sample features， GMM is used to fuse the time series features and sample features of the bird's eye view， and then decode the fine-grained aerial view of the future frame scene. The experimental results on the public dataset nuScenes show that compared with the benchmark algorithm LSS， the vehicle segmentation IoU index of MSTFB method is improved by 7.09%， which can effectively segment remote vehicles and occluded vehicles. ESF-BISP can better capture the changes of dynamic instances in the scene， whether for instance segmentation or for future instance segmentation prediction， the performance is significantly better than the benchmark algorithm.

Key words: computer application technology, instance segmentation prediction, bird's eye view temporal encoding, multi-view images, spatiotemporal cross-domain convolutional networks

中图分类号:

TP391.41

冯霞,陈爽,卢敏,左海超. 融合多源时空信息鸟瞰图的未来实例分割预测[J]. 吉林大学学报(工学版), 2025, 55(10): 3372-3383.

Xia FENG,Shuang CHEN,Min LU,Hai-chao ZUO. Future instance segmentation prediction based on bird’s eye view of multi-source spatiotemporal information fusion[J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(10): 3372-3383.

图/表 11

图1

图2

图3

图4

表1

表2

表3

表4

表5

ESF-BISP的不同模块性能结果"

LSS	CE	T-attention	ST	FD	TD	IoU/%	PQ/%	SQ/%	RQ/%	Params/M	GFLOPs
$√$						32.07	-	-	-	6.7	14.20
$√$	$√$					33.54	-	-	-	6.8	14.39
$√$	$√$	$√$				35.83	-	-	-	7.1	15.76
$√$	$√$	$√$	$√$			39.16	28.53	70.20	41.25	9.1	19.53
$√$	$√$	$√$	$√$	$√$		39.65	30.62	70.95	43.09	9.8	20.32
$√$	$√$	$√$	$√$	$√$	$√$	40.10	31.89	71.63	43.97	10.5	21.67

表5

表6

图5

参考文献 25

[1]	Wang X, Girdhar R, Yu S X, et al. Cut and learn for unsupervised object detection and instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3124-3134.
[2]	Hurtik P, Molek V, Hula J, et al. Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3[J]. Neural Computing and Applications, 2022, 34(10): 8275-8290.
[3]	毛琳, 任凤至, 杨大伟, 等. 双向特征金字塔全景分割网络[J].吉林大学学报: 工学版,2022, 52(3): 657-665.
	Mao Lin, Ren Feng-zhi, Yang Da-wei, et al. Two⁃way feature pyramid network for panoptic segmentation[J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 657-665.
[4]	Ke L, Danelljan M, Li X, et al. Mask transfiner for high-quality instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2022: 4412-4421.
[5]	Cheng T H, Wang X G, Chen S Y, et al. Boxteacher: Exploring high-quality pseudo labels for weakly supervised instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3145-3154.
[6]	霍光, 林大为, 刘元宁, 等. 基于多尺度特征和注意力机制的轻量级虹膜分割模型[J]. 吉林大学学报: 工学版, 2023, 53(9): 2591-2600.
	Huo Guang, Lin Da-wei, Liu Yuan-ning, et al. Lightweight iris segmentation model based on multiscale feature and attention mechanism[J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(9): 2591-2600.
[7]	Deng L Y, Yang M, Li H, et al. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(10): 4350-4362.
[8]	Lu C Y, Wan de, Gerardus M J G, Dubbelman G.Monocular semantic occupancy grid mapping with convolutional variational encoder-decoder networks[J].IEEE Robotics and Automation Letters, 2019, 4(2):445-452.
[9]	Pan B, Sun J, Leung H Y T, et al. Cross-view semantic segmentation for sensing surroundings[J]. IEEE Robotics and Automation Letters, 2020, 5(3): 4867-4873.
[10]	Philion J, Fidler S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D[C]∥The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 194-210.
[11]	Khalil Y H, Mouftah H T. End-to-end multi-view fusion for enhanced perception and motion prediction[C]∥IEEE 94th Vehicular Technology Conference, Piscataway, USA, 2021: 1-6.
[12]	Hendy N, Sloan C, Tian F, et al. FISHING net: Future inference of semantic heatmaps in grids[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020.
[13]	Ma Y, Wang T, Bai X, et al. Vision-centric BEV perception: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024,46(12):1-20.
[14]	Akan A K, Güney F. Stretchbev: Stretching future instance prediction spatially and temporally[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 444-460.
[15]	Li P I, Ding S X, Chen X Y L, et al. PowerBEV: a powerful yet lightweight framework for instance prediction in bird's-eye view[DB/OL]. [2023-10-22]..
[16]	Hu Y H, Yang J Z, Chen L, et al. Planning-oriented autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 17853-17862.
[17]	Hu A, Murez Z, Mohan N, et al. FIERY: Future instance prediction in bird's-eye view from surround monocular cameras[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Piscataway,USA, 2021: 15273-15282.
[18]	Yuan F N, Zhang L, Xia X, et al. A gated recurrent network with dual classification assistance for smoke semantic segmentation[J]. IEEE Transactions on Image Processing, 2021, 30: 4409-4422.
[19]	Mao Y X, Zhang J, Xiang M C, et al. Multimodal variational auto-encoder based audio-visual segmentation[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Piscataway, USA, 2023: 954-965.
[20]	Tan M X, Le Q V. Efficientnet: Rethinking model scaling for convolutional neural networks[C]∥International Conference on Machine Learning, Long Beach, USA, 2019: 6105-6114.
[21]	Riaz F, Rehman S, Ajmal M, et al. Gaussian mixture model based probabilistic modeling of images for medical image segmentation[J]. IEEE Access, 2020, 8: 16846-16856.
[22]	Lyu S W, Fan Y B, Ying Y M, et al. Average top-k aggregate loss for supervised learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(1): 76-86.
[23]	Caesar H, Bankiti V, Lang A H, et al. nuscenes: A multimodal dataset for autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020: 11621-11631.
[24]	Mandal S, Biswas S, Balas V E, et al. Lyft 3D object detection for autonomous vehicles[M]∥Rabindra Shaw,Artificial Intelligence for Future Generation Robotics: Amsterdam: Elsevier, 2021: 119-136.
[25]	Gong S, Ye X, Tan X Q, et al. GitNet: Geometric prior-based transformation for birds-eye-view segmentation[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 396-411.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

参数	数值
迭代数epochs	40
批大小batch_size	4
学习率learning_rate	0.000 3
优化器optimizer	Adam
丢弃函数dropout	Yes
精度Precision	32 bit
分布采样维度latent_dim 图像通道数channels	32 64
图像深度depth	48
损失函数Top-k	25%
图像通道数channels	64

方法	车辆实例分割IoU	车辆可行驶区域分割IoU	车道线分割IoU
IPM^［7］ VED^［8］	10.30 23.28	40.10 60.82	14.10 16.74
VPN^［9］	28.17	65.97	17.05
Fishing Camera^［12］	30.06	68.30	24.60
Fishing Lidar^［12］	40.30	71.20	31.70
GitNet^［25］	34.20	65.10	31.90
LSS^［10］	32.07	72.23	19.96
MSTFB	39.16	76.82	38.20

数据集	方法	PQ	SQ	RQ
nuScenes	StrechBev^［14］ FIERY Static^［17］	29.30 27.64	69.8 70.05	43.7 39.08
	FIERY^［17］	29.90	70.20	42.50
	PowerBev^［15］	30.98	70.97	42.86
	UniAD^［16］	31.80	72.30	43.10
	ESF-BISP	31.89	71.63	43.97

数据集	方法	IoU	PQ
nuScenes	VPN^［9］	28.17	-
	Fishing Camera^［12］	30.00	-
	LSS^［10］	32.07	-
	MSTFB	39.16	-
	FIERY Static^［17］	33.20	27.64
	FIERY^［17］	39.70	29.90
	StrechBev^［14］	37.10	29.30
	UniAD^［16］	39.91	30.50
	PowerBev^［15］	39.89	30.98
	ESF-BISP	40.10	31.89
Lyft	LSS^［10］	44.64	-
	Fishing Lidar^［12］	56.00	-
	FIERY^［17］	59.40	36.70
	ESF-BISP	60.10	38.80

方法	感知运行时间	预测运行时间	总运行时间
StrechBev^［14］	504	136	640
FIERY^［17］	504	118	622
PowerBev^［15］	503	63	566
ESF-BISP	512	121	633