吉林大学学报(理学版) ›› 2026, Vol. 64 ›› Issue (2): 394-0402.

• • 上一篇    下一篇

 基于联合知识迁移的低资源语音关键词检测

黄金鑫1, 贺前华1, 郑若伟1, 杨茗茹1, 王文武2   

  1. 1. 华南理工大学 电子与信息学院, 广州 510641; 2. 萨里大学 视觉、 语音和信号处理中心, 英国 吉尔福德 GU2 7XH
  • 收稿日期:2024-08-14 出版日期:2026-03-26 发布日期:2026-03-26
  • 通讯作者: 贺前华 E-mail:eeqhhe@scut.edu.cn

Low-Resource Speech Keyword Spotting Based on Joint Knowledge Transfer

HUANG Jinxin1, HE Qianhua1, ZHENG Ruowei1, YANG Mingru1, WANG Wenwu2   

  1. 1. School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China;
    2. Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford GU2 7XH, UK
  • Received:2024-08-14 Online:2026-03-26 Published:2026-03-26

摘要: 针对低资源条件下语音关键词检测准确率较低的问题, 提出一种联合无监督特征提取与有监督模型参数迁移的检测方法. 首先, 利用大规模无标注语音数据训练深度特征提取网络, 并将提取的特征与声学谱图特征进行融合, 以增强特征对声学环境的鲁棒性; 其次, 利用源域丰富的有标注数据对判决网络进行预训练, 通过参数迁移的方式引入判决知识, 解决目标域训练数据不足导致的模型难收敛问题; 最后, 使用极少量目标域数据对整体网络进行微调. 在客家话及粤语数据集上的实验结果表明, 该方法显著优于单一迁移策略, 在客家话任务中错误拒绝率降至11.77%, 加权关键词最大值提升至0.734 6. 实验结果证明该方法能有效缓解数据匮乏问题, 显著提升低资源语种的检测性能.

关键词: 语音关键词检测, 深度学习, 低资源, 联合知识迁移

Abstract: Aiming at the problem of  the low accuracy of speech keyword spotting under low-resource conditions, we proposed a detection method combining unsupervised feature extraction and supervised model parameter transfer. Firstly, a deep feature extraction network was trained by using large-scale unlabeled speech data, and the extracted features were fused with acoustic spectrogram 
features to enhance robustness of the features to  acoustic environments. Secondly, the decision network was pre-trained by using rich labeled data from the source domain, and decision knowledge was introduced through parameter transfer to solve the problem of model convergence difficulty caused by insufficient training data in the target domain. Finally, the entire network was fine-tuned by using  a very small amount of target domain data. Experimental results on Hakka and Cantonese datasets show that this method significantly outperforms single transfer strategies. In the Hakka task, the false rejection rate is reduced to 11.77%, and the maximum term weighted  value is improved to 0.734 6. The experimental results demonstrate that the proposed method can effectively alleviate the problem of data scarcity and significantly improve detection performance for low-resource languages.

Key words: speech keyword spotting, deep learning, low-resource, joint knowledge transfer

中图分类号: 

  • TP391