融合多源时空信息鸟瞰图的未来实例分割预测

doi:10.13229/j.cnki.jdxbgxb.20231460

Abstract

Abstract:

Aiming at the problems of difficult identification of occluded objects and insufficient robustness to noise and viewing angle changes in existing instance segmentation， this paper proposes a method of multi-source spatio-temporal information based fine-grained bird's-eye view generation（MSTFB）. The method is based on a rasterized scene bird's eye view， the self-attention mechanism is utilized to fuse temporal bird's eye view features to obtain the scene fine-graine bird's eye view， and the spatiotemporal cross-domain convolutional network is employed to capture the relative position information between instances and fuse the multi-scale features. On this basis， a bird's-eye view instance segmentation prediction method of encoding and sample fusion （ESF-BISP） is proposed. ConvGRU is used to encode the time series semantics of the historical frame to obtain the time series features， and CVAE is adopted to model the state feature distribution of the current frame fine-grained bird's eye view and sample the bird's eye view sample features， GMM is used to fuse the time series features and sample features of the bird's eye view， and then decode the fine-grained aerial view of the future frame scene. The experimental results on the public dataset nuScenes show that compared with the benchmark algorithm LSS， the vehicle segmentation IoU index of MSTFB method is improved by 7.09%， which can effectively segment remote vehicles and occluded vehicles. ESF-BISP can better capture the changes of dynamic instances in the scene， whether for instance segmentation or for future instance segmentation prediction， the performance is significantly better than the benchmark algorithm.

Key words: computer application technology, instance segmentation prediction, bird's eye view temporal encoding, multi-view images, spatiotemporal cross-domain convolutional networks

CLC Number:

TP391.41

Xia FENG,Shuang CHEN,Min LU,Hai-chao ZUO. Future instance segmentation prediction based on bird’s eye view of multi-source spatiotemporal information fusion[J].Journal of Jilin University(Engineering and Technology Edition), 2025, 55(10): 3372-3383.

Figures/Tables 11

Fig.1

Fig.2

Fig.3

Fig.4

Table 1

Table 2

Table 3

Table 4

Table 5

Performance results of different modules for ESF-BISP"

LSS	CE	T-attention	ST	FD	TD	IoU/%	PQ/%	SQ/%	RQ/%	Params/M	GFLOPs
$√$						32.07	-	-	-	6.7	14.20
$√$	$√$					33.54	-	-	-	6.8	14.39
$√$	$√$	$√$				35.83	-	-	-	7.1	15.76
$√$	$√$	$√$	$√$			39.16	28.53	70.20	41.25	9.1	19.53
$√$	$√$	$√$	$√$	$√$		39.65	30.62	70.95	43.09	9.8	20.32
$√$	$√$	$√$	$√$	$√$	$√$	40.10	31.89	71.63	43.97	10.5	21.67

Table 5

Table 6

Fig.5

References 25

[1]	Wang X, Girdhar R, Yu S X, et al. Cut and learn for unsupervised object detection and instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3124-3134.
[2]	Hurtik P, Molek V, Hula J, et al. Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3[J]. Neural Computing and Applications, 2022, 34(10): 8275-8290.
[3]	毛琳, 任凤至, 杨大伟, 等. 双向特征金字塔全景分割网络[J].吉林大学学报: 工学版,2022, 52(3): 657-665.
	Mao Lin, Ren Feng-zhi, Yang Da-wei, et al. Two⁃way feature pyramid network for panoptic segmentation[J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 657-665.
[4]	Ke L, Danelljan M, Li X, et al. Mask transfiner for high-quality instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2022: 4412-4421.
[5]	Cheng T H, Wang X G, Chen S Y, et al. Boxteacher: Exploring high-quality pseudo labels for weakly supervised instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3145-3154.
[6]	霍光, 林大为, 刘元宁, 等. 基于多尺度特征和注意力机制的轻量级虹膜分割模型[J]. 吉林大学学报: 工学版, 2023, 53(9): 2591-2600.
	Huo Guang, Lin Da-wei, Liu Yuan-ning, et al. Lightweight iris segmentation model based on multiscale feature and attention mechanism[J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(9): 2591-2600.
[7]	Deng L Y, Yang M, Li H, et al. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(10): 4350-4362.
[8]	Lu C Y, Wan de, Gerardus M J G, Dubbelman G.Monocular semantic occupancy grid mapping with convolutional variational encoder-decoder networks[J].IEEE Robotics and Automation Letters, 2019, 4(2):445-452.
[9]	Pan B, Sun J, Leung H Y T, et al. Cross-view semantic segmentation for sensing surroundings[J]. IEEE Robotics and Automation Letters, 2020, 5(3): 4867-4873.
[10]	Philion J, Fidler S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D[C]∥The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 194-210.
[11]	Khalil Y H, Mouftah H T. End-to-end multi-view fusion for enhanced perception and motion prediction[C]∥IEEE 94th Vehicular Technology Conference, Piscataway, USA, 2021: 1-6.
[12]	Hendy N, Sloan C, Tian F, et al. FISHING net: Future inference of semantic heatmaps in grids[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020.
[13]	Ma Y, Wang T, Bai X, et al. Vision-centric BEV perception: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024,46(12):1-20.
[14]	Akan A K, Güney F. Stretchbev: Stretching future instance prediction spatially and temporally[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 444-460.
[15]	Li P I, Ding S X, Chen X Y L, et al. PowerBEV: a powerful yet lightweight framework for instance prediction in bird's-eye view[DB/OL]. [2023-10-22]..
[16]	Hu Y H, Yang J Z, Chen L, et al. Planning-oriented autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 17853-17862.
[17]	Hu A, Murez Z, Mohan N, et al. FIERY: Future instance prediction in bird's-eye view from surround monocular cameras[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Piscataway,USA, 2021: 15273-15282.
[18]	Yuan F N, Zhang L, Xia X, et al. A gated recurrent network with dual classification assistance for smoke semantic segmentation[J]. IEEE Transactions on Image Processing, 2021, 30: 4409-4422.
[19]	Mao Y X, Zhang J, Xiang M C, et al. Multimodal variational auto-encoder based audio-visual segmentation[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Piscataway, USA, 2023: 954-965.
[20]	Tan M X, Le Q V. Efficientnet: Rethinking model scaling for convolutional neural networks[C]∥International Conference on Machine Learning, Long Beach, USA, 2019: 6105-6114.
[21]	Riaz F, Rehman S, Ajmal M, et al. Gaussian mixture model based probabilistic modeling of images for medical image segmentation[J]. IEEE Access, 2020, 8: 16846-16856.
[22]	Lyu S W, Fan Y B, Ying Y M, et al. Average top-k aggregate loss for supervised learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(1): 76-86.
[23]	Caesar H, Bankiti V, Lang A H, et al. nuscenes: A multimodal dataset for autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020: 11621-11631.
[24]	Mandal S, Biswas S, Balas V E, et al. Lyft 3D object detection for autonomous vehicles[M]∥Rabindra Shaw,Artificial Intelligence for Future Generation Robotics: Amsterdam: Elsevier, 2021: 119-136.
[25]	Gong S, Ye X, Tan X Q, et al. GitNet: Geometric prior-based transformation for birds-eye-view segmentation[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 396-411.

Related Articles 15

[1]	Yu-fei ZHANG,Li-min WANG,Jian-ping ZHAO,Zhi-yao JIA,Ming-yang LI. Robot inverse kinematics solution based on center selection battle royale optimization algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(8): 2703-2710.
[2]	Wen-hui LI,Chen YANG. Few-shot remote sensing image classification based on contrastive learning text perception [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(7): 2393-2401.
[3]	Xiang-jiu CHE,Liang LI. Graph similarity measurement algorithm combining global and local fine-grained features [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(7): 2365-2371.
[4]	Jian WANG,Chen-wei JIA. Trajectory prediction model for intelligent connected vehicle [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 1963-1972.
[5]	Feng-feng ZHOU,Zhe GUO,Yu-si FAN. Feature representation algorithm for imbalanced classification of multi⁃omics cancer data [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 2089-2096.
[6]	Xiang-jiu CHE,Yu-peng SUN. Graph node classification algorithm based on similarity random walk aggregation [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 2069-2075.
[7]	Xiang-jiu CHE,Yu-ning WU,Quan-le LIU. A weighted isomorphic graph classification algorithm based on causal feature learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(2): 681-686.
[8]	Rui-feng ZHANG,Fang-zhao GUO,Qiang LI. Chest X-ray images classification based on multi-scale attention information multiplexing network [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(11): 3686-3696.
[9]	Li-ming LIANG,Long-song ZHOU,Jiang YIN,Xiao-qi SHENG. Fusion multi-scale Transformer skin lesion segmentation algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(4): 1086-1098.
[10]	Dondrub LHAKPA,Duoji ZHAXI,Jie ZHU. Tibetan text normalization method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3577-3588.
[11]	Yu-xin YE,Luo-jia XIA,Ming-hui SUN. Gesture input method based on transparent keyboard in augmented reality environment [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(11): 3274-3282.
[12]	Na CHE,Yi-ming ZHU,Jian ZHAO,Lei SUN,Li-juan SHI,Xian-wei ZENG. Connectionism based audio-visual speech recognition method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(10): 2984-2993.
[13]	Ya-hui ZHAO,Fei-yu LI,Rong-yi CUI,Guo-zhe JIN,Zhen-guo ZHANG,De LI,Xiao-feng JIN. Korean⁃Chinese translation quality estimation based on cross⁃lingual pretraining model [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(8): 2371-2379.
[14]	Shan XUE,Ya-liang ZHANG,Qiong-ying LYU,Guo-hua CAO. Anti⁃unmanned aerial vehicle system object detection algorithm under complex background [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(3): 891-901.
[15]	Zhen WANG,Xiao-han YANG,Nan-nan WU,Guo-kun LI,Chuang FENG. Ordinal cross entropy Hashing based on generative adversarial network [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(12): 3536-3546.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

参数	数值
迭代数epochs	40
批大小batch_size	4
学习率learning_rate	0.000 3
优化器optimizer	Adam
丢弃函数dropout	Yes
精度Precision	32 bit
分布采样维度latent_dim 图像通道数channels	32 64
图像深度depth	48
损失函数Top-k	25%
图像通道数channels	64

方法	车辆实例分割IoU	车辆可行驶区域分割IoU	车道线分割IoU
IPM^［7］ VED^［8］	10.30 23.28	40.10 60.82	14.10 16.74
VPN^［9］	28.17	65.97	17.05
Fishing Camera^［12］	30.06	68.30	24.60
Fishing Lidar^［12］	40.30	71.20	31.70
GitNet^［25］	34.20	65.10	31.90
LSS^［10］	32.07	72.23	19.96
MSTFB	39.16	76.82	38.20

数据集	方法	PQ	SQ	RQ
nuScenes	StrechBev^［14］ FIERY Static^［17］	29.30 27.64	69.8 70.05	43.7 39.08
	FIERY^［17］	29.90	70.20	42.50
	PowerBev^［15］	30.98	70.97	42.86
	UniAD^［16］	31.80	72.30	43.10
	ESF-BISP	31.89	71.63	43.97

数据集	方法	IoU	PQ
nuScenes	VPN^［9］	28.17	-
	Fishing Camera^［12］	30.00	-
	LSS^［10］	32.07	-
	MSTFB	39.16	-
	FIERY Static^［17］	33.20	27.64
	FIERY^［17］	39.70	29.90
	StrechBev^［14］	37.10	29.30
	UniAD^［16］	39.91	30.50
	PowerBev^［15］	39.89	30.98
	ESF-BISP	40.10	31.89
Lyft	LSS^［10］	44.64	-
	Fishing Lidar^［12］	56.00	-
	FIERY^［17］	59.40	36.70
	ESF-BISP	60.10	38.80

方法	感知运行时间	预测运行时间	总运行时间
StrechBev^［14］	504	136	640
FIERY^［17］	504	118	622
PowerBev^［15］	503	63	566
ESF-BISP	512	121	633

Future instance segmentation prediction based on bird’s eye view of multi-source spatiotemporal information fusion

RICH HTML

PDF (PC)