Journal of Jilin University(Engineering and Technology Edition) ›› 2025, Vol. 55 ›› Issue (5): 1682-1691.doi: 10.13229/j.cnki.jdxbgxb.20230820

Previous Articles     Next Articles

Self-supervised monocular depth estimation based on improved densenet and wavelet decomposition

De-qiang CHENG1(),Wei-chen WANG1,Cheng-gong HAN1,Chen LYU1,Qi-qi KOU2()   

  1. 1.School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116,China
    2.School of Computer Science and Technology,China University of Mining and Technology,Xuzhou,221116 China
  • Received:2023-08-04 Online:2025-05-01 Published:2025-07-18
  • Contact: Qi-qi KOU E-mail:chengdq@cumt.edu.cn;kouqiqi@cumt.edu.cn

Abstract:

The traditional self-supervised monocular depth estimation model has limitations in extracting and fusing shallow features, leading to issues such as omission detection of small objects and blurring of object edges. To address these problems, a self-supervised monocular depth estimation model based on improved dense network and wavelet decomposition is proposed in this paper. The whole framework of the model follows the structure of U-net, in which the encoder adopts the improved densenet to improve the ability of feature extraction and fusion. A detail enhancement module is introduced in the skipping connections to further refine and integrate the multi-scale features generated by the encoder. The decoder incorporates wavelet decomposition, enabling better focus on high-frequency information during decoding to achieve precise edge refinement. Experimental results demonstrate that our method exhibits stronger capability in capturing depth estimation for small objects, resulting in clearer and more accurate edges in the generated depth map.

Key words: signal and information processing, depth estimation, self-supervision, densenet, wavelet decomposition, detail enhancement

CLC Number: 

  • TP391.41

Fig.1

The deep estimation network architecture in this paper"

Table 1

Encoder network structure parameters"

尺度DenseNet121-DDenseNet169-D
卷积S/27×7 conv, 64, stride 2

Dense block

(1)

S/21×1conv3×3conv×6, stirde 11×1conv3×3conv×6, stirde 1

Transition layer

(1)

S/4

1×1 conv, 128, stride 1

2×2 平均池化, stride 2

Dense block

(2)

S/41×1conv3×3conv×12, stirde 11×1conv3×3conv×12, stirde 1

Transition layer

(2)

S/8

1×1 conv, 256, stride 1

2×2 平均池化, stride 2

Dense block

(3)

S/81×1conv3×3conv×24, stride 11×1conv3×3conv×32, stride 1

Transition layer

(3)

S/16

1×1 conv, 512, stride 1

2×2 平均池化, stride 2

1×1 conv, 640, stride 1

2×2 平均池化, stride 2

Dense block

(4)

S/16

1×1conv3×3conv×16, stirde 1,

1 024

1×1conv3×3conv×32, stirde 1,

1 664

Downsample layerS/32

Batch norm, ReLU,

2×2 平均池化, stride 2

Fig.2

DEM structure"

Fig.3

Inverse discrete wavelet transform"

Fig.4

Self-supervised depth estimation framework"

Table 2

Test results on the KITTI dataset"

方法监督方式误差准确度
AbsRelSqRelRMSERMSElogδ<1.25δ<1.252δ<1.253
Garg6S0.1521.2265.8490.2460.7840.9210.967
Monodepth R5013S0.1331.1425.5330.2300.8300.9360.970
StrAT28S0.1281.0195.4030.2270.8270.9350.971
3Net R5029S0.1290.9965.2810.2230.8310.9390.974
3Net VGG29S0.1191.2015.8880.2080.8440.9410.978
SuperDepth30S0.1120.8754.9580.2070.8520.9470.977
VA-Depth16M0.1120.8644.8040.1900.8770.9590.982
Zeeshan17M0.1130.9034.8630.1930.8770.9590.981
STDepthFormer18M0.1100.8054.6780.1870.8780.9610.983
Monodepth28S0.1090.8734.9600.2090.8640.9480.975
WaveletMonodepth21(基线)S0.1100.8764.9160.2060.8640.9500.976
本文S0.1030.8014.7270.2010.8770.9530.976
Monodepth2(HD)8S0.1070.8494.7640.2010.8740.9530.977
WaveletMonodepth(HD)21S0.1050.7974.7320.2030.8690.9520.977
本文(HD)S0.0970.7264.5310.1950.8840.9550.978
Monodepth28MS0.1060.8184.7500.1960.8740.9570.979
WaveletMonodepth21MS0.1090.8144.8080.1980.8680.9550.980
本文MS0.1000.7314.5360.1900.8820.9590.980

Table 3

Migration test results on the CityScapes dataset"

方法AbsRelSqRelRMSERMSElog
Monodepth R50130.2102.2309.4300.311
Monodepth280.1821.8808.8700.253
WaveletMonodepth210.1851.9508.6590.248
本文0.1751.8318.5900.242

Fig.5

Comparison of visualization results on the KITTI dataset"

Fig.6

Comparison of visualization results on the CityScapes dataset"

Table 4

Ablation experiment of modules"

实验Densenet-DDEMWaveletAbsRelSqRelRMSERMSElogδ<1.25δ<1.252δ<1.253
1×××0.1080.8424.8910.2070.8650.9490.976
2××0.1100.8764.9160.2060.8640.9500.976
3×0.1070.8654.8910.2040.8670.9500.976
4×0.1050.8104.7390.2020.8760.9520.976
50.1030.8014.7270.2010.8770.9530.976

Table 5

Ablation experiment of encoder design"

实验编码器AbsRelSqRelRMSERMSElogδ<1.25δ<1.252δ<1.253
1DenseNet1210.1060.8624.8820.2030.8720.9510.976
2DenseNet1690.1080.8924.9720.2050.8710.9500.976
3DenseNet121-D0.1030.8014.7270.2010.8770.9530.976
4DenseNet169-D0.1020.7974.7450.2000.8790.9530.976
[1] 王新竹, 李骏, 李红建, 等. 基于三维激光雷达和深度图像的自动驾驶汽车障碍物检测方法[J]. 吉林大学学报 (工学版), 2016, 46(2): 360-365.
Wang Xin-zhu, Li Jun, Li Hong-jian, et al. Obstacle detection based on 3D laser scanner and range image for intelligent vehicle[J]. Journal of Jilin University (Engineering and Technology Edition), 2016, 46(2): 360-365.
[2] 张宇翔, 任爽. 定位技术在虚拟现实中的应用综述[J]. 计算机科学, 2021, 48(1): 308-318.
Zhang Yu-xiang, Ren Shuang. Overview of the application of location technology in virtual reality[J]. Computer Science, 2021,48 (1): 308-318.
[3] 史晓刚, 薛正辉, 李会会, 等. 增强现实显示技术综述[J]. 中国光学, 2021, 14(5): 1146-1161.
Shi Xiao-gang, Xue Zheng-hui, Li Hui-hui, et al. Overview of augmented reality display technology [J]. China Optics, 2021, 14 (5): 1146-1161.
[4] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture[C]∥2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015: 2650-2658.
[5] Fu H, Gong M, Wang C, et al. Deep ordinal regression network for monocular depth estimation[C]∥ 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2002-2011.
[6] Garg R, Vijay K B G, Carneiro G, et al. Unsupervised CNN for single view depth estimation: geomery to the rescue[C]∥European Conference Computer Vision, Amsterdam, Netherlands, 2016: 740-756.
[7] Zhou T H, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 1851-1858.
[8] Clément G, Oisin M A, Michael F, et al. Digging into self-supervised monocular depth estimation[C]∥ 2015 IEEE International Conference on Computer Vision (ICCV), Seoul, South Korea, 2019: 3828-3838.
[9] Ashutosh S, Sun M, Andrew Y N. Make3D:learning 3D scene structure from a single still image[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 31(5): 824-840.
[10] Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network[C]∥Advances in Neural Information Processing Systems, Montreal, Canada, 2014: 2366-2374.
[11] Zachary T, Jia D. Deepv2D: video to depth with differentiable structure from motion[C]∥International Conference on Learning Representations (ICLR) 2020, Addis Ababa, Ethiopian, 2020: 181204605.
[12] Benjamin U, Zhou H Z, Jonas U, et al. Demon: depth and motion network for learning monocular stereo[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 5038-5047.
[13] Clément G, Oisin M A, Gabriel J B. Unsupervised monocular depth estimation with left-right consistency[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 270-279.
[14] Bian J W, Li Z C, Wang N, et al. Unsupervised scale-consistent depth and ego-motion learning from monocular video[C]∥33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019: 1-12.
[15] Han C, Cheng D, Kou Q, et al. Self-supervised monocular depth estimation with multi-scale structure similarity loss[J]. Multimedia Tools and Applications, 2022, 31: 3251-3266.
[16] Xiang J, Wang Y, An L,et al. Visual attention-based self-supervised absolute depth estimation using geometric priors in autonomous driving[J/OL].(2022-10-06)[2023-06-13]..
[17] Suri Z K. Pose constraints for consistent self-supervised monocular depth and ego-motion[J/OL].(2023-04-18)[2023-06-13]..
[18] Houssem B, Adrian V, Andrew C. STDepthFormer: predicting spatio-temporal depth from video with a self-supervised transformer model[C]∥Detroit, USA, 2023: No.230301196.
[19] Matteo P, Filippo A, Fabio T, et al. Towards real-time unsupervised monocular depth estimation on CPU[C]∥2018 IEEE/RSJ international Conference Intelligent Robots and Systems (IROS), Madrid, Spain, 2018: 5848-5854.
[20] Diana W, Ma F C, Yang T J, et al. FastDepth: fast monocular depth estimation on embedded systems[C]∥2019 International Conference on Robotics and Automation (ICRA), Montreal, Canada, 2019: 6101-6108.
[21] Michael R, Michael F, Jamie W, et al. Single image depth prediction with wavelet decomposition[C] ∥ Conference on Computer Vision and Pattern Recognition (CVPR), Online, 2021: 11089-11098.
[22] Olaf R, Philipp F, Thomas B. U-Net: convolutional networks for biomedical image segmentation[C]∥International Conference On Medical Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 2015: 234-241.
[23] Huang G, Liu Z, Maaten L V D, et al. Densely connected convolutional networks[C]∥2017 Conference on Computer Vision and Pattern Recognition (CVPR), Hawaii, USA, 2017: 2261-2269.
[24] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]∥2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 770-778.
[25] Chen X T, Chen X J, Zha Z J. Structure-aware residual pyramid network for monocular depth estimation[C]∥28th International Joint Conference on Artificial Intelligence, Macau, China, 2019: 694-700.
[26] Geiger A, Lenz P, Stiller C, et al. Vision meets robotics: the kitti dataset[J]. The International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[27] Pleiss G, Chen D, Huang G, et al. Memory-efficient implementation of densenets[J/OL].(2017-07-21)[2023-06-13]..
[28] Mehta I, Sakurikar P, Narayanan P J. Structured adversarial training for unsupervised monocular depth estimation[C]∥2018 International Conference on 3D Vision, Verona, Italy, 2018: 314-323.
[29] Matteo P, Fabio T, Stefano M. Learning monocular depth estimation with unsupervised trinocular assumptions[C]∥International Conference on 3D Vision (3DV), Verona, Italy, 2018: 324-333.
[30] Sudeep P, Rares A, Ambrus G, et al. Superdepth: self-supervised, super-resolved monocular depth estimation[C]∥2019 International Conference on Robotics and Automation (ICRA), Montreal, Canada, 2019: 9250-9256.
[1] Jia-bao ZHANG,Jian-yang ZHANG,He LIU,Yan LI. Fast star point extraction with improved run-length encoding algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(4): 1467-1473.
[2] Liu-bo HU,Jian-xin WU,Quan-hua LIU,Lei ZHANG. Staged suppression of mainlobe compound interference using airborne distributed arrays [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(3): 1103-1110.
[3] Hai-tao WANG,Hui-zhuo LIU,Xue-yong ZHANG,Jian WEI,Xiao-yuan GUO,Jun-zhe XIAO. Forward-looking visual field reproduction for vehicle screen-displayed closed cockpit using monocular vision [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(5): 1435-1442.
[4] Yu-ting SU,Meng-yao JING,Pei-guang JING,Xian-yi LIU. Deep photometric stereo learning framework for battery defect detection [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3653-3659.
[5] Hui-jing DOU,Dong-xu XIE,Wei GUO,Lu-yang XING. Direction of arrival estimation based on improved orthogonal matching pursuit algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3568-3576.
[6] Lin BAI,Lin-jun LIU,Xuan-ang LI,Sha WU,Ru-qing LIU. Depth estimation algorithm of monocular image based on self-supervised learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(4): 1139-1145.
[7] Chun-yang WANG,Wen-qian QIU,Xue-lian LIU,Bo XIAO,Chun-hao SHI. Accurate segmentation method of ground point cloud based on plane fitting [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(3): 933-940.
[8] Xue-mei LI,Chun-yang WANG,Xue-lian LIU,Chun-hao SHI,Guo-rui LI. Point cloud registration method based on supervoxel bidirectional nearest neighbor distance ratio [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(8): 1918-1925.
[9] Zhen WANG,Meng GAI,Heng-shuo XU. Surface reconstruction algorithm of 3D scene image based on virtual reality technology [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(7): 1620-1625.
[10] Xue-mei LI,Chun-yang WANG,Xue-lian LIU,Da XIE. Time delay estimation of linear frequency-modulated continuous-wave lidar signals via SESTH [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 950-958.
[11] Le-ping LIN,Zeng-tong LU,Ning OUYANG. Face reconstruction and recognition in non⁃cooperative scenes [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(12): 2941-2946.
[12] Hui-jing DOU,Gang DING,Jia GAO,Xiao LIANG. Wideband signal direction of arrival estimation based on compressed sensing theory [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(6): 2237-2245.
[13] Xin-yu JIN,Mu-han XIE, SUN-Bin. Grain information compressed sensing based on semi-tensor product approach [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(1): 379-385.
[14] Li⁃min GUO,Xin CHEN,Tao CHEN. Radar signal modulation type recognition based on AlexNet model [J]. Journal of Jilin University(Engineering and Technology Edition), 2019, 49(3): 1000-1008.
[15] WANG Ji-xin, JI Jing-fang, ZHANG Ying-shuang, WANG Nai-xiang, ZHANG Er-ping, HUANG Jian-bing. Denoising method of time domain load signals of engineering vehicles based on wavelet and fractal theory [J]. 吉林大学学报(工学版), 2011, 41(增刊2): 221-225.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . 0803mulu[J]. 吉林大学学报(工学版), 2008, 38(03): 1 -0002 .
[2] LI Shu-sheng, ZHONG Mai-ying. Design of control system based on PID of three-axis inertially stabilized platform for airborne remote sensing[J]. 吉林大学学报(工学版), 2011, 41(增刊1): 275 -279 .
[3] HE Rong, GUO Rui, GUAN Xin, PENG Li-en. Physical mechanism and parameter identification of Fancher leaf spring model[J]. 吉林大学学报(工学版), 2013, 43(01): 12 -16 .
[4] GUO Li-bin, ZHANG Bin, CUI Hai, ZHANG Zhi-hang. Structural parameters of 3D roughness for micro wire electrical discharge machining surface[J]. 吉林大学学报(工学版), 2015, 45(3): 851 -856 .
[5] ZHU Bing, JIA Xiao-feng, WANG Yu, WU Jian, ZHAO Jian, DENG Wei-wen. Rapid prototype test for integrated vehicle dynamic control using two dSPACE simulators[J]. 吉林大学学报(工学版), 2016, 46(1): 8 -14 .
[6] WANG Jian-jian, FENG Ping-fa, ZHANG Jian-fu, WU Zhi-jun, ZHANG Guo-bin, YAN Pei-long. Modeling of centering accuracy of chuck and its maintaining characteristics and restoration method[J]. 吉林大学学报(工学版), 2016, 46(2): 487 -493 .
[7] ZHUNG Wei-min, ZHAO Wen-zeng, XIE Dong-xuan, LI Bing. Joint performance analysis on connection of ultrahigh-strength steel and aluminum alloy with hot riveting[J]. 吉林大学学报(工学版), 2018, 48(4): 1016 -1022 .
[8] Gao Xing-quan, Ma Miao-miao, Chen Hong . Optimal control for T-S fuzzy systems with time domain hard constraints[J]. 吉林大学学报(工学版), 2007, 37(03): 640 -0645 .
[9] WU Jian-rong, LI Jun-ying, LIU Hai-tao. Temporal-frequency denoising application in event detection base on image sequences[J]. , 2012, 42(05): 1273 -1279 .
[10] ZHANG Lin, ZHAO Hong-wei, YANG Yi-han, MA Zhi-chao, HUANG Hu, MA Zhi-chao. Molecular dynamics simulation of nanoindentation of single-layer graphene sheet[J]. 吉林大学学报(工学版), 2013, 43(06): 1558 -1565 .