一种基于后缀数组的无词典分词方法

一种基于后缀数组的无词典分词方法

张长利, 赫枫龄, 左万利

吉林大学计算机科学与技术学院, 长春 130012

收稿日期:2004-02-25 修回日期:1900-01-01 出版日期:2004-10-26 发布日期:2004-10-26
通讯作者: 左万利

An automatic and dictionary-free Chinese wordsegmentation method based on suffix array

ZHANG Chang-li, HE Feng-ling, ZUO Wan-li

College of Computer Science and Technology, Jilin University, Changchun 130012

Received:2004-02-25 Revised:1900-01-01 Online:2004-10-26 Published:2004-10-26
Contact: ZHANG Chang-li

摘要/Abstract

摘要： 提出一种基于后缀数组的无词典分词算法. 该算法通过后缀数组和利用散列表获得汉字的结合模式, 通过置信度筛选词. 实验表明, 在无需词典和语料库的前提下, 该算法能够快速准确地抽取文档中的中、高频词. 适用于对词条频度敏感、对计算速度要求高的中文信息处理.

关键词: 中文信息处理, 中文自动分词, 后缀数组, 散列表

Abstract: An automatic and dictionary-free Chinese word segmentation method based on suffix array algorithm is proposed. By the algorithm based on suffix array and by using HashMap the co-occurrence patterns of Chinese characters are gotten, and Chinese words are filtered through confidence. Experiment results show that by the algorithm one can acquire high frequency lexical items effectively and efficiently without the help of the dictionary and corpus as well. This method is particularly suitable for lexical-frequency-sensitive as well as time-critical Chinese information processing application.

Key words: Chinese information processing, automatic Chinese word segmentation, suffix array, HashMap

中图分类号:

TP391.12

张长利, 赫枫龄, 左万利. 一种基于后缀数组的无词典分词方法[J]. J4, 2004, 42(04): 548-553.

ZHANG Chang-li, HE Feng-ling, ZUO Wan-li. An automatic and dictionary-free Chinese wordsegmentation method based on suffix array[J]. J4, 2004, 42(04): 548-553.

[1]	袁哲, 赵永哲, 张文睿, 朱祥彬, 赵东伟. 利用水平分割法计算给定串中的所有Maximal(NE/SNE) Repeats[J]. J4, 2008, 46(05): 915-924.
[2]	李宾, 刘淑媛, 刘衍珩. 基于散列表的快速分组分类算法[J]. J4, 2005, 43(06): 787-793.