Journal of Jilin University(Engineering and Technology Edition) ›› 2025, Vol. 55 ›› Issue (3): 1001-1008.doi: 10.13229/j.cnki.jdxbgxb.20240121

   

Automatic extraction of Chinese text topic sentences based on TextRank algorithm and similarity

Hai-lan DING1(),Kun-yu QI2   

  1. 1.School of Chinese Language and Literature,Lanzhou Jiaotong University,Lanzhou 730070,China
    2.China Institute of Information Technology for Nationalities,Northwest Minzu University,Lanzhou 730030,China
  • Received:2024-01-30 Online:2025-03-01 Published:2025-05-20

Abstract:

Aiming at the complex semantic rules and potential topic structures in different fields, which result in poor scalability and portability of topic sentence generation, low similarity between Chinese text topic sentences, and high redundancy in topic sentence extraction from Chinese text, a Chinese text topic sentence automatic extraction method based on TextRank algorithm and similarity was proposed. Using the Bi LSTM model for Chinese text segmentation, continuous Chinese text is segmented into independent words. Through mutual information method, Chinese text feature selection is carried out, and feature values are calculated to extract the most relevant and representative features of the task (such as keywords and clue words). Keywords and clue words are used as important clues and basis for topic sentence extraction. Based on the TextRank algorithm and similarity, various weights and weight coefficients were considered to automatically extract Chinese text topic sentences. The experimental results show that the proposed method has low redundancy in automatic extraction of Chinese text topic sentences, and the completeness, scalability, and portability of the document are all in a good state. Moreover, the results of ROUGE-1, ROUGE-2, and ROUGE-L are all high. Ensure the automatic extraction effect of Chinese text topic sentences, with a high degree of application.

Key words: Chinese text, textrank algorithm, mutual information, topic sentence extraction, bidirectional long short term memory network

CLC Number: 

  • TP301

Fig.1

Automatic extraction process of Chinese text topic sentences"

Table 1

Chinese word segmentation effect of this paper method"

数据集文档功能完备性易扩充性可移植性
LCSTS1完备易扩充可移植
2完备易扩充可移植
3完备易扩充可移植
4完备易扩充可移植
5完备易扩充可移植
Sub-THUCNews6完备易扩充可移植
7完备易扩充可移植
8完备易扩充可移植
9完备易扩充可移植
10完备易扩充可移植

Table 2

Chinese word segmentation effect of reference [3] methods"

数据集文档功能完备性易扩充性可移植性
LCSTS1不完备不易扩充不可移植
2完备易扩充可移植
3不完备不易扩充不可移植
4完备易扩充不可移植
5完备不易扩充可移植
Sub-THUCNews6不完备易扩充可移植
7完备易扩充不可移植
8不完备不易扩充可移植
9不完备不易扩充不可移植
10完备不易扩充可移植

Table 3

Chinese word segmentation effect of reference [4] Methods"

数据集文档功能完备性易扩充性可移植性
LCSTS1完备易扩充可移植
2完备易扩充可移植
3不完备不易扩充不可移植
4不完备易扩充不可移植
5完备不易扩充可移植
Sub-THUCNews6完备不易扩充可移植
7完备易扩充不可移植
8不完备易扩充不可移植
9不完备不易扩充不可移植
10不完备不易扩充可移植

Fig.2

Redundancy comparison"

Fig.3

Comparison of rouge indicators"

Fig.4

Keyword extraction accuracy for different number of topics"

1 陈梦彤, 谷晓燕, 刘甜甜. 基于改进TextRank的关键句提取方法[J]. 郑州大学学报: 理学版, 2023, 55(1): 15-20.
Chen Meng-tong, Gu Xiao-yan, Liu Tian-tian. The method of key sentence extraction based on improved textrank[J]. Journal of Zhengzhou University (Natural Science Edition), 2023,55 (1): 15-20.
2 阮光册, 黄韵莹. 融合Sentence-BERT和LDA的评论文本主题识别[J].现代情报, 2023, 43(5): 46-53.
Ruan Guang-ce, Huang Yun-ying. Topic recognition of comment text based on Sentence-BERT and LDA[J]. Journal of Modern Information, 2023,43 (5): 46-53.
3 葛斌, 何春辉, 黄宏斌. 融合关键信息的PGN文本主题句生成方法[J].计算机工程与设计, 2022, 43(6): 1601-1608.
Ge Bin, He Chun-hui, Huang Hong-bin. PGN text topic sentence generation method based on key information[J]. Computer Engineering and Design, 2022, 43 (6): 1601-1608.
4 井钰, 王名扬, 周文远. 基于BBCM-TextRank的文本摘要提取算法研究[J]. 东北师大学报: 自然科学版, 2022, 54(3): 67-75.
Jing Yu, Wang Ming-yang, Zhou Wen-yuan. Research on text fabstract extraction algorithm based on BBCM-TextRank[Z]. Journal of Northeast Normal University (Natural Science Edition), 2022, 54 (3): 67-75.
5 Sun J, Li Y, Shen Y, et al.Selection gate-based networks for semantic relation extraction[J].International Journal of Embedded Systems, 2021, 14(3): 211.
6 Xiong A, Liu D, Tian H, et al.News keyword extraction algorithm based on semantic clustering and word graph model[J].Tsinghua Science and Technology, 2021, 26(6): 886-893.
7 Ning B, Zhao D, Liu X, et al.EAGS: an extracting auxiliary knowledge graph model in multi-turn dialogue generation[J].World Wide Web, 2023, 26(4):1545-1566.
8 Taufiq U, Pulungan R, Suyanto Y. Named entity recognition and dependency parsing for better concept extraction in summary obfuscation detection[J].Expert Systems with Applications, 2023, 217(5): No.119579.
9 张军, 赖志鹏, 李学, 等.基于新词发现的跨领域中文分词方法[J].电子与信息学报, 2022, 44(9): 3241-3248.
Zhang Jun, Lai Zhi-peng, Li Xue, et al. Cross-domain chinese word segmentation based on new word discovery[J]. Journal of Electronics & Information Technology, 2022, 44(9): 3241-3248.
10 郝永彬, 周兰江, 刘畅. 一种基于LSTM的端到端多任务老挝语分词方法[J]. 中文信息学报, 2021, 35(9): 75-81.
Hao Yong-bin, Zhou Lan-jiang, Liu Chang. An end-to-end multi task method for laotian word segmentation via LSTM[J]. Journal of Chinese Information Processing, 2021,35 (9): 75-81.
11 徐久成, 孟祥茹, 瞿康林, 等. 基于模糊邻域相对依赖互信息的特征选择方法[J]. 模糊系统与数学, 2023, 37(1): 121-135.
Xu Jiu-cheng, Meng Xiang-ru, Qu Kang-lin, et al. Feature selection method based on fuzzy neighborhood relative dependency mutual information[J]. Fuzzy Systems and Mathematics, 2023,37 (1): 121-135.
12 孙林, 施恩惠, 司珊珊, 等. 基于AP聚类和互信息的弱标记特征选择方法[J]. 南京师大学报: 自然科学版, 2022, 45(3): 108-115.
Sun Lin, Shi En-hui, Si Shan-shan, et al. Weak label feature selection method based on AP clustering and mutual information[J].Journal of Nanjing Normal University (Natural Science Edition), 2022, 45(3): 108-115.
13 赵占芳, 刘鹏鹏, 李雪山. 基于改进TextRank的铁路文献关键词抽取算法[J]. 北京交通大学学报, 2021, 45(2): 80-86.
Zhao Zhan-fang, Liu Peng-peng, Li Xue-shan. Keywords extraction algorithm of railway literature based on improved TextRank[J]. Journal of Beijing Jiaotong University, 2021,45 (2): 80-86.
14 叶子诚, 闫桂英. 基于图模型的关键词提取算法研究[J]. 系统科学与数学, 2021, 41(4): 967-975.
Ye Zi-cheng, Yan Gui-ying. Study on keyword extraction algorithm based on graphical model[J]. Journal of Systems Science and Mathematical Sciences, 2021, 41(4): 967-975.
15 孙旭, 沈彬, 严馨, 等. 基于Transformer和TextRank的微博观点摘要方法[J]. 广西师范大学学报:自然科学版, 2023, 41(4): 96-108.
Sun Xu, Shen Bin, Yan Xin, et al. Microblog opinion summarization method based on transformer and textrank[J]. Journal of Guangxi Normal University(Natural Science Edition), 2023,41(4): 96-108.
[1] TAN Si-qiao, ZHANG Xi, LI Qian, AI Chen. Information push model-building based on maximum mutual information coefficient [J]. 吉林大学学报(工学版), 2018, 48(2): 558-563.
[2] AN Ru, WANG Hui-lin, WANG Ying, CHEN Chun-ye, ZHANG Qin, XU Xiao-feng. Fast image matching by using mutual information with 16 histogram bins and improved particle swarm optimization algorithm [J]. 吉林大学学报(工学版), 2013, 43(增刊1): 357-364.
[3] WANG Jin-fang, GUO Ming, NIE Xin-li. Optimal setting method for frame length and frame shift of interframe difference phase spectrum [J]. 吉林大学学报(工学版), 2013, 43(增刊1): 6-10.
[4] WANG Hong,ZHAO Hai-bin, LIU Chong. Feature extraction from electroencephalography signal using wavelet entropy and band power [J]. 吉林大学学报(工学版), 2011, 41(03): 828-831.
[5] YANG Jin-Bao, LIU Chang-Chun, HU Shun-Bo. Arithmetic harmonic mean divergence measure for elastic image registration [J]. 吉林大学学报(工学版), 2009, 39(05): 1390-1394.
[6] HE Kai, WANG Shu-xun, DAI Yi-song. New SNR estimation method to 1/ƒ fractal signal [J]. 吉林大学学报(工学版), 2004, (1): 35-39.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!