Journal of Jilin University(Engineering and Technology Edition) ›› 2025, Vol. 55 ›› Issue (8): 2732-2740.doi: 10.13229/j.cnki.jdxbgxb.20240051

Previous Articles    

Text-based guided face image inpainting

Jing LIAN1(),Ji-bao ZHANG1,Ji-zhao LIU2,Jia-jun ZHANG1,Zi-long DONG1   

  1. 1.School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730030,China
    2.School of Information Science and Engineering,Lanzhou University,Lanzhou 730030,China
  • Received:2024-01-15 Online:2025-08-01 Published:2025-11-14

Abstract:

A text-guided face image inpainting method is proposed to address the problems of structural distortion, texture blurring, and uncontrollability in current face inpainting methods. The method reconstructs the missing regions in an image by fusing image features and corresponding text features. In the network training, a visual-textual modal fusion module is designed for associating image and textual features so that the reconstruction of missing regions of the face is not only based on the visual semantics visible in the image, but also guided by textual semantics with rich text. An attention-aware layer is added between the encoded and decoded features to improve the consistency of the appearance of the visible and generated regions. Experimental results on the CelebA-HQ face dataset show that the method in this paper is able to obtain restoration results that are more natural and consistent with the textual semantics in terms of texture and structure, and its visual effect and evaluation metrics are better than those of the comparison algorithms.

Key words: image inpainting, textual guidance, cross-modal fusion, deep learning

CLC Number: 

  • TP391

Fig.1

Overall architecture of proposed method"

Table 1

Network structure"

输入网络层输出尺寸通道数卷积核步长填充归一化激活函数
文本描述BERT4512NoneNoneNoneNoneNone
TT-decoding08512723BNReLU
T_Dec0T-decoding116512522BNReLU
T_Dec1T-decoding232128321BNReLU
T_Dec2T-decoding364128321BNReLU
T_Dec3T-decoding412864321BNReLU
T_Dec4T-decoding52563321BNReLU
I_maskI_mI-encoding012864321BNReLU
I_Enc0I-encoding164128321BNReLU
I_Enc1I-encoding232128321BNReLU
I_Enc2I-encoding316256321BNReLU
I_Enc3I-encoding48512321BNReLU
I_Enc4I-encoding54512321BNReLU
I_Enc5I-decoding08512321BNReLU
I_Enc5, I_Dec0, T_Dec0V-TMF08512NoneNoneNoneBNReLU
I_V-TMF0I-decoding116512321BNReLU
I_Enc4, I_Dec1, T_Dec1V-TMF116512NoneNoneNoneBNReLU
I_V-TMF1I-decoding232128321BNReLU
I_Enc3, I_Dec2, T_Dec2V-TMF232128NoneNoneNoneBNReLU
I_V-TMF2I-decoding364128321BNReLU
I_Enc2, I_Dec3, T_Dec3V-TMF364128NoneNoneNoneBNReLU
I_V-TMF3I-decoding412864321BNReLU
I_Enc1, I_Dec4, T_Dec4V-TMF412864NoneNoneNoneBNReLU
I_V-TMF4I-decoding52563321BNReLU
I_Enc0, I_Dec5, T_Dec5V-TMF52563NoneNoneNoneNoneNone

Fig.2

Visual-textual modal fusion module"

Fig.3

Structure of the multi-attention model"

Fig.4

Attention-Aware Layer"

Fig.5

Comparison of test results of different image inpainting methods"

Table 2

Analysis of quantitative results of different image inpainting methods and different mask area ratios"

Masks1%~10%10%~20%20%~30%30%~40%40%~50%50%~60%
L1RFR2.1241.4581.3353.9681.7262.241
RePaint0.3250.4120.3580.4742.2151.564
CTSDG0.0940.1430.4220.3350.7261.102
MIGT0.0960.1220.4220.4170.6241.934
TDANet0.1040.1390.3870.4330.6431.024
本文方法0.0850.1040.3150.2140.5340.987
PSNR↑RFR14.1313.8217.5213.2116.7422.23
RePaint29.5429.1425.1725.4423.8622.64
CTSDG32.5131.2426.2427.1424.4724.58
MIGT32.4531.7828.5328.3624.8924.84
TDANet32.3831.2728.8428.1725.0124.76
本文方法33.2132.9429.0429.7525.6125.10
SSIM↑RFR0.3940.5210.7210.7230.6750.795
RePaint0.9510.9240.8330.8100.7480.614
CTSDG0.9670.9510.8380.8540.8250.787
MIGT0.8470.8250.8130.8000.8210.794
TDANet0.9350.8710.8110.7940.8570.824
本文方法0.9810.9640.8920.8740.8640.840
LPIPS↓RFR0.4640.2470.1780.1940.1240.126
RePaint0.0330.0370.0690.0780.1010.151
CTSDG0.0210.0290.0480.0510.0870.132
MIGT0.0250.0280.0570.0630.0870.121
TDANet0.0280.0330.0470.0540.0920.146
本文方法0.0170.0200.0410.0460.0740.103

Table 3

Analysis of the quantitative results of the ablation experiments"

abL1PSNR↑SSIM↑LPIPS↓
0.72428.3370.8680.047
0.47629.5760.9700.036
0.28732.2141.3160.017

Fig.6

Reconstructing an image with specified text"

[1] 周大可, 张超, 杨欣. 基于多尺度特征融合及双重注意力机制的自监督三维人脸重建[J]. 吉林大学学报: 工学版, 2022, 52(10): 2428-2437.
Zhou Da-ke, Zhang Chao, Yang Xin.Self-supervised 3D face reconstruction based on multi-scale feature fusion and dual attention mechanism[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(10): 2428-2437.
[2] 王小玉, 胡鑫豪, 韩昌林. 基于生成对抗网络的人脸铅笔画算法[J].吉林大学学报: 工学版, 2021, 51(1): 285-292.
Wang Xiao-yu, Hu Xin-hao, Han Chang-lin. Face pencil drawing algorithms based on generative adversarial network[J]. Journal of Jilin University (Engineering and Technology Edition), 2021, 51(1): 285-292.
[3] Pathak D, Krahenbuhl P, Donahue J, et al. Context encoders: feature learning by inpainting[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2536-2544.
[4] Iizuka S, Simo S E, Ishikawa H. Globally and locally consistent image completion[J]. ACM Transactions on Graphics (ToG), 2017, 36(4): 1-14.
[5] Yan Z, Li X, Li M, et al. Shift-net: image inpainting via deep feature rearrangement[C]∥Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 2018: 1-17.
[6] Liu H, Wan Z, Huang W, et al. Pd-gan: probabilistic diverse gan for image inpainting[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021: 9371-9381.
[7] Wan Z, Zhang J, Chen D, et al. High-fidelity pluralistic image completion with transformers[C]∥Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, USA, 2021: 4692-4701.
[8] Li W, Lin Z, Zhou K, et al. Mat: mask-aware transformer for large hole image inpainting[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 10758-10768.
[9] Huang W, Deng Y, Hui S, et al. Sparse self-attention transformer for image inpainting[J]. Pattern Recognition, 2024, 145: 109897.
[10] Lian J, Zhang J, Liu J, et al. Guiding image inpainting via structure and texture features with dual encoder[J]. The Visual Computer, 2024, 40: 4303-4317.
[11] Devlin J, Chang M W, Lee K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[J/OL]. [2023-12-16]. arXiv preprint arXiv:.
[12] Johnson J, Alahi A, Fei F L. Perceptual losses for real-time style transfer and super-resolution[C]∥14th European Conference, Amsterdam, The Netherlands, 2016: 694-711.
[13] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115: 211-252.
[14] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J/OL].[2023-12-17]. arXiv preprint arXiv:.
[15] Mao X, Li Q, Xie H, et al. Least squares generative adversarial networks[C]∥Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017: 2794-2802.
[16] Gatys L A, Ecker A S, Bethge M. Image style transfer using convolutional neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 2414-2423.
[17] Liu G, Reda F A, Shih K J, et al. Image inpainting for irregular holes using partial convolutions[C]∥Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018: 85-100.
[18] Li J, Wang N, Zhang L, et al. Recurrent feature reasoning for image inpainting[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 7760-7768.
[19] Lugmayr A, Danelljan M, Romero A, et al. Repaint: inpainting using denoising diffusion probabilistic models[C]∥Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 11461-11471.
[20] Chen L, Yuan C, Qin X, et al. Contrastive structure and texture fusion for image inpainting[J]. Neurocomputing, 2023, 536: 1-12.
[21] Li A, Zhao L, Zuo Z, et al. MIGT: multi-modal image inpainting guided with text[J]. Neurocomputing, 2023, 520: 376-385.
[22] Zhang L, Chen Q, Hu B, et al. Text-guided neural image inpainting[C]∥Proceedings of the 28th ACM International Conference on Multimedia, New York, USA, 2020: 1302-1310.
[1] Yuan-ning LIU,Xing-zhe WANG,Zi-yu HUANG,Jia-chen ZHANG,Zhen LIU. Stomach cancer survival prediction model based on multimodal data fusion [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(8): 2693-2702.
[2] Jing-shu YUAN,Wu LI,Xing-yu ZHAO,Man YUAN. Semantic matching model based on BERTGAT-Contrastive [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(7): 2383-2392.
[3] Hui-zhi XU,Dong-sheng HAO,Xiao-ting XU,Shi-sen JIANG. Expressway small object detection algorithm based on deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(6): 2003-2014.
[4] Ru-bo ZHANG,Shi-qi CHANG,Tian-yi ZHANG. Review on image information hiding methods based on deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1497-1515.
[5] Jian LI,Huan LIU,Yan-qiu LI,Hai-rui WANG,Lu GUAN,Chang-yi LIAO. Image recognition research on optimizing ResNet-18 model based on THGS algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1629-1637.
[6] Bin WEN,Yi-fu DING,Chao YANG,Yan-jun SHEN,Hui LI. Self-selected architecture network for traffic sign classification [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(5): 1705-1713.
[7] Zhen-jiang LI,Li WAN,Shi-rui ZHOU,Chu-qing TAO,Wei WEI. Dynamic estimation of operational risk of tunnel traffic flow based on spatial-temporal Transformer network [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(4): 1336-1345.
[8] Meng-xue ZHAO,Xiang-jiu CHE,Huan XU,Quan-le LIU. A method for generating proposals of medical image based on prior knowledge optimization [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(2): 722-730.
[9] Hui-zhi XU,Shi-sen JIANG,Xiu-qing WANG,Shuang CHEN. Vehicle target detection and ranging in vehicle image based on deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(1): 185-197.
[10] Yuan-ning LIU,Zi-nan ZANG,Hao ZHANG,Zhen LIU. Deep learning-based method for ribonucleic acid secondary structure prediction [J]. Journal of Jilin University(Engineering and Technology Edition), 2025, 55(1): 297-306.
[11] Lei ZHANG,Jing JIAO,Bo-xin LI,Yan-jie ZHOU. Large capacity semi structured data extraction algorithm combining machine learning and deep learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(9): 2631-2637.
[12] Lu Li,Jun-qi Song,Ming Zhu,He-qun Tan,Yu-fan Zhou,Chao-qi Sun,Cheng-yu Zhou. Object extraction of yellow catfish based on RGHS image enhancement and improved YOLOv5 network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(9): 2638-2645.
[13] Xin-gang GUO,Ying-chen HE,Chao CHENG. Noise-resistant multistep image super resolution network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(7): 2063-2071.
[14] Bai-you QIAO,Tong WU,Lu YANG,You-wen JIANG. A text sentiment analysis method based on BiGRU and capsule network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(7): 2026-2037.
[15] Li-ping ZHANG,Bin-yu LIU,Song LI,Zhong-xiao HAO. Trajectory k nearest neighbor query method based on sparse multi-head attention [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1756-1766.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!