Journal of Jilin University(Engineering and Technology Edition) ›› 2024, Vol. 54 ›› Issue (12): 3577-3588.doi: 10.13229/j.cnki.jdxbgxb.20230098

Previous Articles     Next Articles

Tibetan text normalization method

Dondrub LHAKPA1,2(),Duoji ZHAXI1,Jie ZHU1,2()   

  1. 1.School of Information Science and Technology,Tibet University,Lhasa 850000,China
    2.Tibet Informatization Collaborative Innovation Center Jointly Built by the Province and the Ministry,Lhasa 850000,China
  • Received:2023-02-03 Online:2024-12-01 Published:2025-01-24
  • Contact: Jie ZHU E-mail:zangye@163.com;790139756@qq.com

Abstract:

In view of the complexity and nonstandard representation of modern Tibetan text, which affects the performance of speech synthesis system, this paper proposes a Tibetan text standardization method with the characteristics of easy maintenance and scalability. Firstly,a deep analysis was conducted on the different manifestations of Tibetan marker symbols and non Tibetan special symbols from other languages in Tibetan texts, and the special symbols were classified based on different features. Secondly, according to the different types of induction, the writing rules for converting 15 special symbols into Tibetan language were respectively established. Finally, using 13 490 sentences as the experimental data, the effectiveness of special symbols and Tibetan syllables in the text is identified and tested through the Tibetan grapheme-to-phoneme conversion test, and the sentences containing special symbols are standardized by the method of rule matching. The experimental results show that the omission rate of Tibetan phoneme transcription before standardization was as high as 4.69%, but after standardization, the omission rate of phoneme transcription was reduced to 0.01%, and the standardization accuracy rate of Tibetan text reached 99%.

Key words: computer application technology, Tibetan text analysis, text normalization, text-to-speech, special symbols, grapheme-to-phoneme

CLC Number: 

  • TP391

Table 1

Tibetan special symbols (part)"

功能标记符号功能标记符号
起始符?? ? ? ?吟诵示意符? ? ? ?
句末符? ? ? ?装饰符号? ? ? ?
历算占星符? ? ? ?标点符号? ? ? ?

Table 2

Comparison table of Tibetan basic digital symbols"

项目藏文数字符号
??????????
汉语含义
藏语含义????????????????????????????????????????
阿拉伯数字0123456789

Table 3

Comparison table of Tibetan semi numeric symbols"

藏文数字符号??????????
阿拉伯数字-0.50.51.52.53.54.55.56.57.58.5

Table 4

Writing method of Tibetan digit words"

项目位 数
1101001 00010 000100 0001 000 00010 000 000100 000 000
含义十万百万千万亿
藏语写法?????????????????????????或????????????或???????????或?????????????

Table 5

Writing method of Tibetan single cardinal words"

项目单基数词
0123456789
含义
藏语写法????????????????????????????????????????

Table 6

Tibetan numeral connectives"

连接词????????????????
使用位置21~2931~3941~4951~5961~6971~7981~8991~99

Table 7

Writing format of Tibetan month"

月份文本中出现的形式规范写法
1月?????1???1??????????
2月?????2???2???????????
3月?????3???3???????????
4月?????4???4??????????
5月?????5???5?????????
6月?????6???6???????????
7月?????7???7???????????
8月?????8???8????????????
9月?????9???9??????????
10月?????10???10??????????
11月?????11???11???????????????
12月?????12???12???????????????

Table 8

Time writing format"

序号含义规范写法
1小时??????
2?????
3?????

Table 9

Writing format of common telephone numbers"

序号电话号码书写格式含义藏语书写规则
10+手机号座机打异地号码从左到右的顺序依次转写
2+国家代码-地区号码-用户号码国际电话号码
3区号+普通号码国内电话号码
43|4|5位数的号码特殊电话号码

Table 10

Standard writing format of common mathematical operators"

符号汉语含义藏语含义藏语书写规则
+加上??????x+??????+y
-减去????x+????+y
×|*乘以?????x+?????+y
÷|/除以????x+????+y

Table 11

Standard writing format of common relation symbols"

符号汉语含义藏语含义举例藏语书写规则
<小于????????x<yx+???+y
>大于???????x>yx+??+y
=等于?????x=yx+?????+y

Table 12

Comparison table of unit abbreviations"

序号缩略词含义藏语规范写法
1mm毫米????????
2mL毫升????????
3g??
4摄氏度??????????????
5°????
6@At????

Table 13

Abbreviations of special terms"

序号特殊符号含义规范格式
1CBA中国篮球职业联赛CBA
2NBA美国职业篮球联赛NBA
3UEFA欧洲冠军联赛UEFA
4VIP重要人物VIP
5CEO首席执行官CEO

Table 14

Table of rules for money"

货币种类藏语规范写法汉语规范写法

藏语

书写规则

USD,$???????????????|???????美元

币种在前,

数字在后

MCY,¥?????????????????|?????人民币
AUD,A$??????????????????????澳大利亚元
?????????英镑

Table 15

Common punctuation symbols in tibetan text"

标点符号含义功能规范格式
( )圆括号在文本中起停顿、语气等作用在文本中不发音,不予转写成藏语
[]方括号
{ }花括号|大括号
《 》书名号
冒号
连接号
“”引号

Fig.1

Experimental flow chart"

Table 16

Results of standardized pre Tibetan phoneme transcription test"

YTYFTTTFOR/%
6 29517801324.69

Table 17

Results of standardized after Tibetan phoneme transcription test"

是否人工介入YTYFTTTFOR/%
6 9020010.01
6 724178012.59

Table 18

Example of experimental results"

测试句子标准化之前的测试结果标准化之后的测试结果

特殊符号检测结果

(音素转写测试1)

文本标准化处理结果

标准的藏语音素序列文本

(音素转写测试2)

???????????650????????????????????????mxii ttej cxii 650 lxaa nnee lxen jqed gjuw rred?????????????????????????????????????????????????????mxii ttej cxii chug gjaa ngaa ccuw lxaa nnee lxen jqed gjuw rred
???1????10????????????????????“110??????????????????”???32??????ddaa 1 tses 10 nxin nzii gjal yxoj gbii “110 chil zhaf nxin mxow ” ah 32 bvaa yxin???????????????????????????????????????????????????????????????????????????????????????????????????ddaa dtah bvow tses ccuw nxin nzii gjal yxoj gbii ccig ccig llad gvor chil zhaf nxin mxow ah ssum ccuw ssow njis bvaa yxin
????????????????????????????????????????????????????????????????????ggee dqun qqos ppel nzii ccii lxow ???? lxoi ccii ddaa ? bvai tses ?? nxin cxuj??????????????????????????????????????????????????????????????????????????????????????????????????????ggee dqun qqos ppel nzii ccii lxow qqig dvoh gguw gjaa ccuw mxed ssum lxoi ccii ddaa xqii bvai tses nxii xxuw nxin cxuj
???????(1990~2022)?????????????????????kkoh nzii (1990~2022)phar phod dtuw xquf bvaa rred?????????????????????????????????????????????????????????????????????????????????????????kkoh nzii qqig dvoh gguw gjaa gguw ccuw nzas nxis dvoh gjaa mxed nxii xxuw zzaa njis phar phod dtuw xquf bvaa rred
“119”??????????????????????????????“119” ah chaj gkah jquh dtuw nnon mxii qqog????????????????????????????????????????????ccig ccig gguw ah chaj gkah jquh dtuw nnon mxii qqog
???????????????-15 ℃??????????ppal qqer chod tsad -15 ℃ yxod bvaa ztaa????????????????????????????????????????????????????ppal qqer chod tsad llad gvor vvog gkii dvuv ccow ngaa yxod bvaa ztaa
?????????1+5=6 ???????????????????????zzis xqii 1+5=6 llob qquh gkii nzah dton rred??????????????????????????????????????????????????????????zzis xqii ccig ddop ngaa tsuj chug llob qquh gkii nzah dton rred
???????????????500.6????hid tsad lxaa mmis 500.6 dqug????????????????????????????????????jhid tsad lxaa mmis ngaa gjaa tseg chug dqug
??????????????:??????????dtaa dvaa qquw tsod ??:?? rred dqug???????????????????????????????????????dtaa dvaa qquw tsod gguw bvaa dtah gvar mxaa ccuw rred dqug
??????????50%?????????????????chaj bqor 50% phar mxah dtuw dvah yxod?????????????????????????????????????????chaj bqor gjaa qqaa ngaa ccuw phar mxah dtuw dvah yxod
????????1984??????????????????????????????????????????NBA????????????????ccii lxow 1984 lxor ngos ssuw aa rrii lxag zzed bvow lxow ttun tsof NBA nzah ngos ssuw xquf????????????????????????????????????????????????????????????????????????????????????NBA????????????????ccii lxow qqig dvoh gguw gjaa gjad ccuw gtaa xqii lxor ngos ssuw aa rrii lxag zzed bvow lxow ttun tsof NBA nzah ngos ssuw xquf
???????????????(1991~2001)????????????????????????kkoh nzii lxow nkow ??(1991~2001)lhaa ssar tsow phaa gbel phaa rred?????????????????(??????????????????????????????????????????????????????????????????????)????????????????????????kkoh nzii lxow nkow ccuw qqig dvoh gguw gjaa gguw ccuw gkow ccig tten nxis dvoh gjaa mxed ccuw mxed ccig lhaa ssar tsow phaa gbel phaa rred
1 Ren Y, Hu C, Tan X, et al. Fastspeech 2: fast and high-quality end-to-end text to speech[DB/OL].[2023-01-06].
2 王莉莉. 面向特定领域藏语统计参数语音合成的文本分析研究[D]. 兰州: 西北师范大学物理与电子工程学院, 2020.
Wang Li-li. Text analysis of speech sythesis based on statical parameters of Tibetan language in specific fields[D]. Lanzhou:College of Physics and Electronic Engineering, Northwest Normal University, 2020.
3 张日培. 藏文文语转换系统关键技术研究[D]. 西宁: 青海师范大学计算机学院, 2018.
Zhang Ri-pei. Research on key technologies of Tibetan text-to-speech system[D]. Xining: College of Computer,Qinghai Normal University, 2018.
4 拉巴顿珠, 欧珠, 祖漪清, 等. 藏语同形异音词的消歧方法研究[J]. 中文信息学报, 2018, 32(7):58-66.
Lhakpa-Dondrub, Ngodrup, ZU Yi-qing, et al. Disambiguation of polyphonic words in Tibetan[J]. Journal of Chinese Information Processing, 2018, 32(7): 58-66.
5 庄暑楠. 基于深度学习的文本规范化的研究与实现[D]. 长春: 吉林大学计算机科学与技术学院, 2020.
Zhuang Shu-nan. Research and implementation of text normalization based on deep learning [D]. Changchun: College of Computer Science and Technology,Jilin University, 2020.
6 Tyagi S, Bonafonte A, Lorenzo-Trueba J, et al. Proteno: text normalization with limited data for fast deployment in text to speech systems[DB/OL].[2023-01-08].
7 Tran O T, Bui V T. Neural text normalization in speech-to-text systems with rich features[J]. Applied Artificial Intelligence, 2021, 35(3): 193-205.
8 Zhang H, Sproat R, Ng A H, et al. Neural models of text normalization for speech applications[J]. Computational Linguistics, 2019, 45(2): 293-337.
9 Mansfield C, Sun M, Liu Y, et al. Neural text normalization with subword units[C]∥Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA,2019: 190-196.
10 Massimo L, Tatyana R, Anne G, et al. Encoder-decoder methods for text normalization[J/OL]. [2023-01-12].
11 Dai W L, Song C H, Li X, et al. An end-to-end Chinese text normalization model based on rule-guided flat-lattice transformer[C]∥Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 2022: 7122-7126.
12 胥桂仙,刘兰寅,张廷,等.基于预训练模型和图神经网络的藏文文本分类研究[J].东北师大学报: 自然科学版,2023,55(1):52-64.
Xu Gui-xian, Liu Lan-yin, Zhang Ting, et al.Tibetan text classification based on pre-trained model and graph neural network[J]. Joural of Northeast Normal University (Natural Science Edition), 2023,55(1):52-64.
13 艾金勇. 面向信息处理的藏文文本规范化方法研究[J]. 西北师范大学学报: 自然科学版, 2017, 53(2):52-56.
Ai Jin-yong. Research on normalization method of Tibetan text for information processing[J]. Journal of Northwest Normal University (Natural Science), 2017, 53(2): 52-56.
14 贡保加, 才智杰, 才让卓玛, 等. 一种藏语语音识别中数字文本规范方法[J].高原科学研究, 2022, 6(3): 117-124.
Gong Bao-jia, Cai Zhi-jie, Cairang-Zhuoma, et al. Study on a method of standardizing digital text in Tibetan speech recognition[J]. Plateau Science Research, 2022, 6(3): 117-124.
15 边巴嘉措. 现代藏语书面语语音结构分析[M]. 北京: 北京民族出版社, 2017.
16 邓戈. 藏语语音研究[M].拉萨: 西藏藏文古籍出版社, 2013.
17 珠杰. 藏文文本自动处理方法研究[M]. 2版.成都:西南交通大学出版社, 2022.
[1] Li-ming LIANG,Long-song ZHOU,Jiang YIN,Xiao-qi SHENG. Fusion multi-scale Transformer skin lesion segmentation algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(4): 1086-1098.
[2] Na CHE,Yi-ming ZHU,Jian ZHAO,Lei SUN,Li-juan SHI,Xian-wei ZENG. Connectionism based audio-visual speech recognition method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(10): 2984-2993.
[3] Ya-hui ZHAO,Fei-yu LI,Rong-yi CUI,Guo-zhe JIN,Zhen-guo ZHANG,De LI,Xiao-feng JIN. Korean⁃Chinese translation quality estimation based on cross⁃lingual pretraining model [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(8): 2371-2379.
[4] Shan XUE,Ya-liang ZHANG,Qiong-ying LYU,Guo-hua CAO. Anti⁃unmanned aerial vehicle system object detection algorithm under complex background [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(3): 891-901.
[5] Zhen WANG,Xiao-han YANG,Nan-nan WU,Guo-kun LI,Chuang FENG. Ordinal cross entropy Hashing based on generative adversarial network [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(12): 3536-3546.
[6] Feng-feng ZHOU,Zhen-wei YAN. A model for identifying neuropeptides by feature selection based on hybrid features [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(11): 3238-3245.
[7] Jun-jie WANG,Yuan-jun NONG,Li-te ZHANG,Pei-chen ZHAI. Visual relationship detection method based on construction scene [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(1): 226-233.
[8] Bing ZHU,Zi-wei LI,Qi LI. Building segmentation method of remote sensing image based on improved SegNet [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(1): 248-254.
[9] Gui-he QIN,Jun-feng HUANG,Ming-hui SUN. Text input based on two⁃handed keyboard in virtual environment [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(8): 1881-1888.
[10] Fu-heng QU,Tian-yu DING,Yang LU,Yong YANG,Ya-ting HU. Fast image codeword search algorithm based on neighborhood similarity [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(8): 1865-1871.
[11] Tian BAI,Ming-wei XU,Si-ming LIU,Ji-an ZHANG,Zhe WANG. Dispute focus identification of pleading text based on deep neural network [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(8): 1872-1880.
[12] Ming LIU,Yu-hang YANG,Song-lin ZOU,Zhi-cheng XIAO,Yong-gang ZHANG. Application of enhanced edge detection image algorithm in multi-book recognition [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 891-896.
[13] Shi-min FANG. Multiple source data selective integration algorithm based on frequent pattern tree [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 885-890.
[14] Sheng-sheng WANG,Chen-xu LI,Xiang-yu WANG,Zhi-lin YAO,Yi-shen LIU,Jia-qian WU,Qing-ran YANG. Brain tumor image classification based on improved residual capsule network and sparrow search [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(11): 2653-2661.
[15] Xiang-jiu CHE,He-yuan CHEN. Muti⁃Object dishes detection algorithm based on improved YOLOv4 [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(11): 2662-2668.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LI Shoutao, LI Yuanchun. Autonomous Mobile Robot Control Algorithm Based on Hierarchical Fuzzy Behaviors in Unknown Environments[J]. 吉林大学学报(工学版), 2005, 35(04): 391 -397 .
[2] Liu Qing-min,Wang Long-shan,Chen Xiang-wei,Li Guo-fa. Ball nut detection by machine vision[J]. 吉林大学学报(工学版), 2006, 36(04): 534 -538 .
[3] Li Hong-ying; Shi Wei-guang;Gan Shu-cai. Electromagnetic properties and microwave absorbing property
of Z type hexaferrite Ba3-xLaxCo2Fe24O41
[J]. 吉林大学学报(工学版), 2006, 36(06): 856 -0860 .
[4] Zhang Quan-fa,Li Ming-zhe,Sun Gang,Ge Xin . Comparison between flexible and rigid blank-holding in multi-point forming[J]. 吉林大学学报(工学版), 2007, 37(01): 25 -30 .
[5] Yang Shu-kai, Song Chuan-xue, An Xiao-juan, Cai Zhang-lin . Analyzing effects of suspension bushing elasticity
on vehicle yaw response character with virtual prototype method
[J]. 吉林大学学报(工学版), 2007, 37(05): 994 -0999 .
[6] . [J]. 吉林大学学报(工学版), 2007, 37(06): 1284 -1287 .
[7] Che Xiang-jiu,Liu Da-you,Wang Zheng-xuan . Construction of joining surface with G1 continuity for two NURBS surfaces[J]. 吉林大学学报(工学版), 2007, 37(04): 838 -841 .
[8] Liu Han-bing, Jiao Yu-ling, Liang Chun-yu,Qin Wei-jun . Effect of shape function on computing precision in meshless methods[J]. 吉林大学学报(工学版), 2007, 37(03): 715 -0720 .
[9] . [J]. 吉林大学学报(工学版), 2007, 37(04): 0 .
[10] Li Yue-ying,Liu Yong-bing,Chen Hua . Surface hardening and tribological properties of a cam materials[J]. 吉林大学学报(工学版), 2007, 37(05): 1064 -1068 .