一种基于WordNet和Corpus Statistics的语义相似性计算方法

J4 ›› 2010, Vol. 48 ›› Issue (05): 811-816.

一种基于WordNet和Corpus Statistics的语义相似性计算方法

张东娜, 周春光, 刘彦斌, 郭东伟

吉林大学计算机科学与技术学院, 长春 130012

收稿日期:2009-12-15 出版日期:2010-09-26 发布日期:2010-09-21
通讯作者: 郭东伟 E-mail:guodw@jlu.edu.cn

A Semantic Similarity Computing Approach Based onWordNet and Corpus Statistics

ZHANG Dongna, ZHOU Chunguang, LIU Yanbin, GUO Dongwei

College of Computer Science and Technology, Jilin University, Chan
gchun 130012, China

Received:2009-12-15 Online:2010-09-26 Published:2010-09-21
Contact: GUO Dongwei E-mail:guodw@jlu.edu.cn

摘要/Abstract

摘要：

提出一种新的基于WordNet和文本集语义参数IC的计算方法, 通过综合考虑概念在WordNet中语义信息以及数据集中的概率信息, 即概念的自信息, 同时利用新的参数考虑概念对在WordNet中的共享信息, 设计了一种通用的概念语义相似性计算方法, 该方法简化了传统语义相似性算法, 并解决了语义相似性计算领域的相关问题, 可以应用在信息抽取、信息检索、文档分类及本体学习中. 领域通用的数据集R&B数据实验结果表明, 该方法在计算语义相似度问题上有效.

关键词: 语义相似性；布朗词集； IC模式

Abstract:

We first proposed a new method calculating semantic similarity parameter information content. The new algorithm is based on the concept semantic information in the knowledge base called WordNet and the probability in the corpus called selfinformation. Then, considering the existing algorithms are all domainrelated and the calculating processes are complicated, we proposed a universal method based on corpus statistics and WordNet calculating semantic similarity which can be used in information extraction, information retrieval, document clustering and ontology learning. The proposed method makes a substantial improvement experimenting on the benchmark data setR&B concept pairs.

Key words: semantic similarity of concepts, Brown corpus, information content method

中图分类号:

TP391.1

张东娜, 周春光, 刘彦斌, 郭东伟. 一种基于WordNet和Corpus Statistics的语义相似性计算方法[J]. J4, 2010, 48(05): 811-816.

ZHANG Dong-Na, ZHOU Chun-Guang, LIU Pan-Bin, GUO Dong-Wei. A Semantic Similarity Computing Approach Based onWordNet and Corpus Statistics[J]. J4, 2010, 48(05): 811-816.

[1]	张凯勇, 周春光, 王康平, 郭东伟, 翟延冬. 基于扩展关系的信息量计算方法[J]. J4, 2011, 49(06): 1068-1072.
[2]	郭东伟, 李三义, 张仲明, 刘淼. 基于模型匹配的Deep Web数据库分类[J]. J4, 2011, 49(03): 487-492.
[3]	赵刚, 郭东伟, 李丹. 基于序列比对的动态Web信息抽取算法[J]. J4, 2010, 48(03): 421-426.
[4]	徐沛娟, 李雄飞, 惠玥, 张桂林. 中文文本分类相关算法的研究与实现[J]. J4, 2009, 47(4): 790-794.