Journal of Jilin University(Engineering and Technology Edition) ›› 2025, Vol. 55 ›› Issue (10): 3372-3383.doi: 10.13229/j.cnki.jdxbgxb.20231460

Previous Articles    

Future instance segmentation prediction based on bird’s eye view of multi-source spatiotemporal information fusion

Xia FENG1,2,3(),Shuang CHEN2,3,Min LU2,3,Hai-chao ZUO2,3   

  1. 1.Institute of Science and Technology Innovation,Civil Aviation University of China,Tianjin 300300,China
    2.College of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China
    3.Key Laboratory of Civil Aviation Smart Airport Theory and System,Civil Aviation University of China,Tianjin 300300,China
  • Received:2023-12-30 Online:2025-10-01 Published:2026-02-03

Abstract:

Aiming at the problems of difficult identification of occluded objects and insufficient robustness to noise and viewing angle changes in existing instance segmentation, this paper proposes a method of multi-source spatio-temporal information based fine-grained bird's-eye view generation(MSTFB). The method is based on a rasterized scene bird's eye view, the self-attention mechanism is utilized to fuse temporal bird's eye view features to obtain the scene fine-graine bird's eye view, and the spatiotemporal cross-domain convolutional network is employed to capture the relative position information between instances and fuse the multi-scale features. On this basis, a bird's-eye view instance segmentation prediction method of encoding and sample fusion (ESF-BISP) is proposed. ConvGRU is used to encode the time series semantics of the historical frame to obtain the time series features, and CVAE is adopted to model the state feature distribution of the current frame fine-grained bird's eye view and sample the bird's eye view sample features, GMM is used to fuse the time series features and sample features of the bird's eye view, and then decode the fine-grained aerial view of the future frame scene. The experimental results on the public dataset nuScenes show that compared with the benchmark algorithm LSS, the vehicle segmentation IoU index of MSTFB method is improved by 7.09%, which can effectively segment remote vehicles and occluded vehicles. ESF-BISP can better capture the changes of dynamic instances in the scene, whether for instance segmentation or for future instance segmentation prediction, the performance is significantly better than the benchmark algorithm.

Key words: computer application technology, instance segmentation prediction, bird's eye view temporal encoding, multi-view images, spatiotemporal cross-domain convolutional networks

CLC Number: 

  • TP391.41

Fig.1

Bird's eye view generatation"

Fig.2

Multi-scale feature fusion based on spatiotemporal cross-domain convolutional network"

Fig.3

Bird's eye view instance segmentation prediction"

Fig.4

Bird's eye view of time series forecasting process"

Table 1

Main parameter setting in ESF-BISP"

参数数值
迭代数epochs40
批大小batch_size4
学习率learning_rate0.000 3
优化器optimizerAdam
丢弃函数dropoutYes
精度Precision32 bit

分布采样维度latent_dim

图像通道数channels

32

64

图像深度depth48
损失函数Top-k25%
图像通道数channels64

Table 2

Comparison of bird’s-eye view instance segmentation on IoU results"

方 法

车辆实例

分割IoU

车辆可行驶

区域分割IoU

车道线

分割IoU

IPM7

VED8

10.30

23.28

40.10

60.82

14.10

16.74

VPN928.1765.9717.05
Fishing Camera1230.0668.3024.60
Fishing Lidar1240.3071.2031.70
GitNet2534.2065.1031.90
LSS1032.0772.2319.96
MSTFB39.1676.8238.20

Table 3

Comparison of future instance segmentation prediction results"

数据集方 法PQSQRQ
nuScenes

StrechBev14

FIERY Static17

29.30

27.64

69.8

70.05

43.7

39.08

FIERY1729.9070.2042.50
PowerBev1530.9870.9742.86
UniAD1631.8072.3043.10
ESF-BISP31.8971.6343.97

Table 4

Comparison of instance segmentation on IoU results and prediction on PQ results"

数据集方 法IoUPQ
nuScenesVPN928.17-
Fishing Camera1230.00-
LSS1032.07-
MSTFB39.16-
FIERY Static1733.2027.64
FIERY1739.7029.90
StrechBev1437.1029.30
UniAD1639.9130.50
PowerBev1539.8930.98
ESF-BISP40.1031.89
LyftLSS1044.64-
Fishing Lidar1256.00-
FIERY1759.4036.70
ESF-BISP60.1038.80

Table 5

Performance results of different modules for ESF-BISP"

LSSCET-attentionSTFDTDIoU/%PQ/%SQ/%RQ/%Params/MGFLOPs
32.07---6.714.20
33.54---6.814.39
35.83---7.115.76
39.1628.5370.2041.259.119.53
39.6530.6270.9543.099.820.32
40.1031.8971.6343.9710.521.67

Table 6

Comparison of runtime analysis"

方法感知运行时间预测运行时间总运行时间
StrechBev14504136640
FIERY17504118622
PowerBev1550363566
ESF-BISP512121633

Fig.5

Comparison of visualization results on nuScenes dataset"

[1] Wang X, Girdhar R, Yu S X, et al. Cut and learn for unsupervised object detection and instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3124-3134.
[2] Hurtik P, Molek V, Hula J, et al. Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3[J]. Neural Computing and Applications, 2022, 34(10): 8275-8290.
[3] 毛琳, 任凤至, 杨大伟, 等. 双向特征金字塔全景分割网络[J].吉林大学学报: 工学版,2022, 52(3): 657-665.
Mao Lin, Ren Feng-zhi, Yang Da-wei, et al. Two⁃way feature pyramid network for panoptic segmentation[J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 657-665.
[4] Ke L, Danelljan M, Li X, et al. Mask transfiner for high-quality instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2022: 4412-4421.
[5] Cheng T H, Wang X G, Chen S Y, et al. Boxteacher: Exploring high-quality pseudo labels for weakly supervised instance segmentation[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 3145-3154.
[6] 霍光, 林大为, 刘元宁, 等. 基于多尺度特征和注意力机制的轻量级虹膜分割模型[J]. 吉林大学学报: 工学版, 2023, 53(9): 2591-2600.
Huo Guang, Lin Da-wei, Liu Yuan-ning, et al. Lightweight iris segmentation model based on multiscale feature and attention mechanism[J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(9): 2591-2600.
[7] Deng L Y, Yang M, Li H, et al. Restricted deformable convolution-based road scene semantic segmentation using surround view cameras[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(10): 4350-4362.
[8] Lu C Y, Wan de, Gerardus M J G, Dubbelman G.Monocular semantic occupancy grid mapping with convolutional variational encoder-decoder networks[J].IEEE Robotics and Automation Letters, 2019, 4(2):445-452.
[9] Pan B, Sun J, Leung H Y T, et al. Cross-view semantic segmentation for sensing surroundings[J]. IEEE Robotics and Automation Letters, 2020, 5(3): 4867-4873.
[10] Philion J, Fidler S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D[C]∥The 16th European Conference on Computer Vision, Glasgow, UK, 2020: 194-210.
[11] Khalil Y H, Mouftah H T. End-to-end multi-view fusion for enhanced perception and motion prediction[C]∥IEEE 94th Vehicular Technology Conference, Piscataway, USA, 2021: 1-6.
[12] Hendy N, Sloan C, Tian F, et al. FISHING net: Future inference of semantic heatmaps in grids[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020.
[13] Ma Y, Wang T, Bai X, et al. Vision-centric BEV perception: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024,46(12):1-20.
[14] Akan A K, Güney F. Stretchbev: Stretching future instance prediction spatially and temporally[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 444-460.
[15] Li P I, Ding S X, Chen X Y L, et al. PowerBEV: a powerful yet lightweight framework for instance prediction in bird's-eye view[DB/OL]. [2023-10-22]..
[16] Hu Y H, Yang J Z, Chen L, et al. Planning-oriented autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2023: 17853-17862.
[17] Hu A, Murez Z, Mohan N, et al. FIERY: Future instance prediction in bird's-eye view from surround monocular cameras[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision,Piscataway,USA, 2021: 15273-15282.
[18] Yuan F N, Zhang L, Xia X, et al. A gated recurrent network with dual classification assistance for smoke semantic segmentation[J]. IEEE Transactions on Image Processing, 2021, 30: 4409-4422.
[19] Mao Y X, Zhang J, Xiang M C, et al. Multimodal variational auto-encoder based audio-visual segmentation[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Piscataway, USA, 2023: 954-965.
[20] Tan M X, Le Q V. Efficientnet: Rethinking model scaling for convolutional neural networks[C]∥International Conference on Machine Learning, Long Beach, USA, 2019: 6105-6114.
[21] Riaz F, Rehman S, Ajmal M, et al. Gaussian mixture model based probabilistic modeling of images for medical image segmentation[J]. IEEE Access, 2020, 8: 16846-16856.
[22] Lyu S W, Fan Y B, Ying Y M, et al. Average top-k aggregate loss for supervised learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(1): 76-86.
[23] Caesar H, Bankiti V, Lang A H, et al. nuscenes: A multimodal dataset for autonomous driving[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Piscataway, USA, 2020: 11621-11631.
[24] Mandal S, Biswas S, Balas V E, et al. Lyft 3D object detection for autonomous vehicles[M]∥Rabindra Shaw,Artificial Intelligence for Future Generation Robotics: Amsterdam: Elsevier, 2021: 119-136.
[25] Gong S, Ye X, Tan X Q, et al. GitNet: Geometric prior-based transformation for birds-eye-view segmentation[C]∥European Conference on Computer Vision, Tel Aviv, Israel, 2022: 396-411.
[1] Yu-fei ZHANG,Li-min WANG,Jian-ping ZHAO,Zhi-yao JIA,Ming-yang LI. Robot inverse kinematics solution based on center selection battle royale optimization algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(8): 2703-2710.
[2] Wen-hui LI,Chen YANG. Few-shot remote sensing image classification based on contrastive learning text perception [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(7): 2393-2401.
[3] Xiang-jiu CHE,Liang LI. Graph similarity measurement algorithm combining global and local fine-grained features [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(7): 2365-2371.
[4] Jian WANG,Chen-wei JIA. Trajectory prediction model for intelligent connected vehicle [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 1963-1972.
[5] Feng-feng ZHOU,Zhe GUO,Yu-si FAN. Feature representation algorithm for imbalanced classification of multi⁃omics cancer data [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 2089-2096.
[6] Xiang-jiu CHE,Yu-peng SUN. Graph node classification algorithm based on similarity random walk aggregation [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 2069-2075.
[7] Xiang-jiu CHE,Yu-ning WU,Quan-le LIU. A weighted isomorphic graph classification algorithm based on causal feature learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(2): 681-686.
[8] Rui-feng ZHANG,Fang-zhao GUO,Qiang LI. Chest X-ray images classification based on multi-scale attention information multiplexing network [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(11): 3686-3696.
[9] Li-ming LIANG,Long-song ZHOU,Jiang YIN,Xiao-qi SHENG. Fusion multi-scale Transformer skin lesion segmentation algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(4): 1086-1098.
[10] Dondrub LHAKPA,Duoji ZHAXI,Jie ZHU. Tibetan text normalization method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3577-3588.
[11] Yu-xin YE,Luo-jia XIA,Ming-hui SUN. Gesture input method based on transparent keyboard in augmented reality environment [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(11): 3274-3282.
[12] Na CHE,Yi-ming ZHU,Jian ZHAO,Lei SUN,Li-juan SHI,Xian-wei ZENG. Connectionism based audio-visual speech recognition method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(10): 2984-2993.
[13] Ya-hui ZHAO,Fei-yu LI,Rong-yi CUI,Guo-zhe JIN,Zhen-guo ZHANG,De LI,Xiao-feng JIN. Korean⁃Chinese translation quality estimation based on cross⁃lingual pretraining model [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(8): 2371-2379.
[14] Shan XUE,Ya-liang ZHANG,Qiong-ying LYU,Guo-hua CAO. Anti⁃unmanned aerial vehicle system object detection algorithm under complex background [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(3): 891-901.
[15] Zhen WANG,Xiao-han YANG,Nan-nan WU,Guo-kun LI,Chuang FENG. Ordinal cross entropy Hashing based on generative adversarial network [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(12): 3536-3546.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!