一种基于规则的无监督词性标注方法

吉林大学学报(理学版)

一种基于规则的无监督词性标注方法

彭涛¹, 戴耀康¹, 朱枫彤¹, 张邦佐², 刘露¹, 闫昭^1, 钱锋¹

1. 吉林大学计算机科学与技术学院, 长春 130012; 2. 东北师范大学计算机科学与信息技术学院, 长春 130117

收稿日期:2014-09-24 出版日期:2015-09-26 发布日期:2015-09-29
通讯作者: 闫昭 E-mail:yanzhao@jlu.edu.cn

RuleBased Method for Unsupervised PartofSpeech Tagging

PENG Tao^1, DAI Yaokang^1, ZHU Fengtong^1, ZHANG Bangzuo^2, LIU Lu^1, YAN Zhao¹, QIAN Feng^1

1. College of Computer Science and Technology, Jilin University, Changchun 130012, China;2. School of Computer Science and Information Technology, Northeast Normal University, Changchun 130117, China

Received:2014-09-24 Online:2015-09-26 Published:2015-09-29
Contact: YAN Zhao E-mail:yanzhao@jlu.edu.cn

摘要/Abstract

摘要：

提出一种基于规则的无监督词性标注方法, 利用200多条英语语法规则, 创建26个规则函数, 先将输入的待标注英语句子进行预处理后得到初始标记, 再对每个单词调用规则函数, 最终得到标注后的英语句子. 通过对Brown语料库的实验, 词性标注的正确率达到9395%. 实验结果表明, 本文方法可行、有效, 能很好地提高英语词性标注的准确率.

关键词: 词性标注, 基于规则, 无监督学习, 规则函数

Abstract:

A rulebased tagging method for unsupervised partofspeech was proposed. More than 200 grammar rules were used to create 26 kinds of rules functions. After it was preprocessed, the initial tags of words in the input sentence were obtained, the 26 kinds of rules functions were applied to each word to attain all the tags of the input sentence. Experimental results on Brown corpus show that the accuracy of our method is up to 93.95%, thus, our rulebased method is feasible and effective, and improves the accuracy and the simplicity of English partofspeech tagging.

Key words: partofspeech tagging, rulebased, unsupervised learning, rules function

中图分类号:

TP181

彭涛, 戴耀康, 朱枫彤, 张邦佐, 刘露, 闫昭, 钱锋. 一种基于规则的无监督词性标注方法[J]. 吉林大学学报(理学版), 2015, 53(05): 956-962.

PENG Tao, DAI Yaokang, ZHU Fengtong, ZHANG Bangzuo, LIU Lu, YAN Zhao, QIAN Feng. RuleBased Method for Unsupervised PartofSpeech Tagging[J]. Journal of Jilin University Science Edition, 2015, 53(05): 956-962.

[1]	张蕾, 姜宇, 孙莉. 一种改进型TF-IDF文本聚类方法[J]. 吉林大学学报(理学版), 2021, 59(5): 1199-1204.
[2]	高云龙, 吴川, 朱明. 基于改进卷积神经网络的短文本分类模型[J]. 吉林大学学报(理学版), 2020, 58(4): 923-930.
[3]	王颖, 曹捷, 邱志洋. 基于乌鸦搜索算法的新型特征选择算法[J]. 吉林大学学报(理学版), 2019, 57(04): 869-874.
[4]	薛小娜, 高淑萍, 彭弘铭, 吴会会. 基于K近邻和多类合并的密度峰值聚类算法[J]. 吉林大学学报(理学版), 2019, 57(1): 111-120.
[5]	董立岩, 王雪松, 王朝阳, 李永丽. 基于区域活跃用户的好友推荐和位置推荐算法[J]. 吉林大学学报(理学版), 2018, 56(6): 1441-1446.
[6]	周水生, 姚丹. 一种改进的LSTSVM增量学习算法[J]. 吉林大学学报(理学版), 2018, 56(4): 909-916.
[7]	王玲娣, 徐华. 一种基于聚类和AdaBoost的自适应集成算法[J]. 吉林大学学报(理学版), 2018, 56(4): 917-924.
[8]	高云龙, 左万利, 王英, 王鑫. 基于集成神经网络的短文本分类模型[J]. 吉林大学学报(理学版), 2018, 56(4): 933-938.
[9]	周水生, 周艳玲, 姚丹, 王保军. 基于QR分解的稀疏LSSVM算法[J]. 吉林大学学报(理学版), 2018, 56(2): 347-354.
[10]	邓蕾蕾, 陈霄. 基于相关向量机的网络通信负载状态识别模型[J]. 吉林大学学报(理学版), 2017, 55(06): 1533-1538.
[11]	陈志雨, 王慧君, 胡明, 刘钢. 一种基于Seeds集和成对约束的主动半监督聚类算法[J]. 吉林大学学报(理学版), 2017, 55(03): 664-672.
[12]	李猛, 刘元宁. 一种基于信息增益的新垃圾邮件特征选择算法[J]. 吉林大学学报(理学版), 2017, 55(02): 379-382.
[13]	郭新辰, 郗仙田, 樊秀玲, 韩啸. 基于半监督的模糊C-均值聚类算法[J]. 吉林大学学报(理学版), 2015, 53(04): 705-709.
[14]	郭新辰, 樊秀玲, 郗仙田, 韩啸. 改进的FCM半监督聚类算法[J]. 吉林大学学报(理学版), 2014, 52(06): 1293-1296.
[15]	王宏志, 刘婉军, 韩啸. 基于全变分自适应保真项去噪算法的数值实现[J]. 吉林大学学报(理学版), 2014, 52(06): 1261-1266.