基于音频匹配的藏语驱动视觉语音合成算法研究

吉林大学学报(信息科学版) ›› 2024, Vol. 42 ›› Issue (3): 509-515.

基于音频匹配的藏语驱动视觉语音合成算法研究

韩西, 梁凯, 岳宇

甘孜州科技信息研究所, 四川康定 626000

收稿日期:2023-04-24 出版日期:2024-06-18 发布日期:2024-06-17
通讯作者: 岳宇(1970— ), 男(藏族), 四川康定人, 甘孜州科技信息研究所研究员, 主要从事科技信息和藏文信息化研究, (Tel)86-18048092050(E-mail)kd_yueyu@ hotmail. com E-mail:kd_yueyu@ hotmail. com
作者简介:韩西(1981— ), 男(藏族), 四川康定人, 甘孜州科技信息研究所高级工程师, 主要从事情报工程和藏文信息化研究, (Tel)86-18048092050(E-mail)mywest@ 163. com
基金资助:
四川省科技计划基金资助项目(2021YFG0138)

Research on Tibetan Driven Visual Speech Synthesis Algorithm Based on Audio Matching

HAN Xi, LIANG Kai, YUE Yu

Ganzi Prefecture Science and Technology Information Research Institute, Kangding 626000, China

Received:2023-04-24 Online:2024-06-18 Published:2024-06-17

摘要/Abstract

摘要： 为解决唇部轮廓检测精度较低、视觉语音合成效果不好的问题, 提出了基于音频匹配的藏语驱动视觉语音合成算法。该算法从藏语驱动视觉语音信号中提取短时能量和过零率, 并建立语音信号的短时自相关函数。首先, 提取语音信号中的特征信息, 以此获得藏语语音信号的基音轨迹, 即音频特征; 其次, 建立了唇部时空分析模型, 分析唇部轮廓在发音过程中变化趋势, 采用主成分分析法提取唇部轮廓特征; 最后, 通过输入输出隐马尔可夫模型获取音频特征与唇部轮廓特征之间的关联, 在音频匹配的基础上合成藏语驱动视觉语音。实验结果表明, 该方法具有较高的唇部轮廓检测精度, 视觉语音合成效果较好。

Abstract: In order to solve the problems of low lip contour detection accuracy and poor visual speech synthesis effect, a Tibetan-driven visual speech synthesis algorithm based on audio matching is proposed. This algorithm extracts short-term energy and short-term zero-crossing rate from Tibetan-language-driven visual speech signal, establishes short-term autocorrelation function of speech signal, and extracts feature information in speech signal, so as to obtain the pitch track of Tibetan speech signal. Secondly, the temporal and spatial analysis model of lip is established to analyze the changing trend of lip contour in the pronunciation process, and the feature of lip contour is extracted by principal component analysis. Finally, the correlation between audio features and lip contour features is obtained through the input-output hidden Markov model, and Tibetan-driven visual speech is synthesized on the basis of audio matching. Experimental results show that the proposed method has high lip contour detection accuracy and good visual speech synthesis effect.

Key words: audio matching, short time autocorrelation function, spatiotemporal analysis model, principal component analysis method, visual speech synthesis

中图分类号:

TP391. 42

韩西, 梁凯, 岳宇 . 基于音频匹配的藏语驱动视觉语音合成算法研究[J]. 吉林大学学报(信息科学版), 2024, 42(3): 509-515.

HAN Xi, LIANG Kai, YUE Yu. Research on Tibetan Driven Visual Speech Synthesis Algorithm Based on Audio Matching[J]. Journal of Jilin University (Information Science Edition), 2024, 42(3): 509-515.

[1]	王桂荣, 金小峰. 基于计算语音方法的朝蒙单元音对比研究[J]. 吉林大学学报(信息科学版), 2019, 37(1): 68-74.
[2]	张勇，张溯，王旭东，路阳，王臣. 基于时频域特征的场景音频研究[J]. 吉林大学学报(信息科学版), 2018, 36(3): 300-305.