吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (4): 909-914.

• • 上一篇    下一篇

基于混合机器学习模型的短文本语义相似性度量算法

韩开旭1, 袁淑芳2   

  1. 1. 北部湾大学 电子与信息工程学院, 广西 钦州 535011; 2. 北部湾大学 理学院, 广西 钦州 535011
  • 收稿日期:2022-04-15 出版日期:2023-07-26 发布日期:2023-07-26
  • 通讯作者: 袁淑芳 E-mail:ysf20210605@126.com

Short Text Semantic Similarity Measurement Algorithm  Based on Hybrid Machine Learning Model

HAN Kaixu1, YUAN Shufang2   

  1. 1. College of Electronics and Information Engineering, Beibu Gulf University, Qinzhou 535011, Guangxi Zhuang Autonomous Region, China; 2. College of Sciences, Beibu Gulf University, Qinzhou 535011, Guangxi Zhuang Autonomous Region, China
  • Received:2022-04-15 Online:2023-07-26 Published:2023-07-26

摘要: 为提高短文本语义相似性度量准确性, 设计一种基于混合机器学习模型的短文本语义相似性度量算法. 先对短文本实施预处理, 基于混合机器学习模型构建短文本的字词向量模型, 对短文本进行特征扩展; 然后组合短文本的多样度量特征, 对多样度量特征进行维度规约; 最后通过构建一个集成学习模型, 计算语义相似性结果, 实现语义相似性的度量. 使用“Quora Question Pairs”比赛数据集测试该方法的性能, 测试结果表明, 该方法的准确性较高, 对数损失和度量均方差均较低, 说明该方法的相似性度量准确性较高.

关键词: 混合机器学习模型, 短文本, 文本分词, 语义相似性, 卡方检验, 相似性度量

Abstract: In order to improve the accuracy of short text semantic similarity measurement, we designed a short text semantic similarity measurement algorithm based on a hybrid machine learning model. Firstly, we preprocessed the short text, constructed a word vector model of the short text based on the hybrid machine learning model, and extended the  features of the short text. Secondly, we  combined the various metric features of the short text, implemented dimensional reduction on the various metric features. Finally, we constructed an ensemble learning  model to calculate the semantic similarity results and achieve the  semantic similarity measurement. We tested the performance of the method by using the “Quora Question Pairs” competition dataset, the test results show that the accuracy of the  method is high, the logarithmic loss, and the measurement mean square error are both low, indicating that the similarity measurement accuracy of the method is high.

Key words: hybrid machine learning model, short text, text segmentation, semantic similarity, Chi-square test, similarity measurement

中图分类号: 

  • TP391