吉林大学学报(信息科学版) ›› 2021, Vol. 39 ›› Issue (6): 720-725.

• • 上一篇    下一篇

基于 Transformer 的电网企业文件密点标注系统

董 添, 李 广, 杨振宇, 张 博, 于 波, 王 巍   

  1. 国网吉林省电力有限公司 党委办公室, 长春 130021
  • 收稿日期:2021-10-15 出版日期:2021-12-01 发布日期:2021-12-02
  • 作者简介:董添(1986— ), 男, 长春人, 国网吉林省电力有限公司高级工程师, 主要从事模式识别、人工智能等研究, ( Tel)86-13154392086(E-mail)2414602110@ qq. com。
  • 基金资助:
    国网吉林公司科技基金资助项目(522342210001)

Annotation System of File Secret Information for Power Grid Enterprise Based on Transformer

DONG Tian, LI Guang, YANG Zhenyu, ZHANG Bo, YU Bo, WANG Wei   

  1. General Committee Office, State Grid Jilin Electric Power Supply Company, Changchun 130021, China
  • Received:2021-10-15 Online:2021-12-01 Published:2021-12-02

摘要: 面对海量的企业文件, 单纯地凭借人工进行密点标注, 不仅费时费力, 其划分标准更受到人为主观意识 的影响。 因此,对企业文件进行自动定密是企业保密管理工作中需要迫切解决的重要问题。 为此, 提出一种基 于 Transformer 的电网企业文件密点标注系统, 包括文件预处理、中文分词、词向量构建和密点标注等步骤。 在 国网吉林省电力有限公司内部核心商密文件和普通商密文件构建的数据集上对所提出的模型进行了训练测试, 结果表明, 该系统准确率为 97. 79% , 召回率为 99. 08% 。 模型达到了较高的识别效果, 且其对密点信息识别准 确, 只有极少数密点信息未被标注, 有效防止了密点信息的泄露。

关键词: 密点标注 , 深度学习 , 中文分词 , 词嵌入 , 企业秘密

Abstract: In the face of a large number of enterprise files, it is time-consuming and laborious to label the encryption points simply by manual, and its division standard is affected by human subjective consciousness. It is an important issue for the automatic classification of enterprise documents, which needs to be solved urgently in enterprise confidentiality management is proposed. Therefore, a file dense point labeling system for power grid enterprises based on transformer. It includes file preprocessing, Chinese word segmentation, word vector construction and secret information annotation. The proposed model is trained and tested on the data set constructed by the internal core commercial secret files and ordinary commercial secret files of State Grid Jilin Electric Power Corporation. The accuracy is 97. 79% and the recall is 99. 08% , indicating that the model has achieved high recognition effect. The recognition of secret information is accurate. There are only a few secret information that have not been marked, which prevents the leakage of secret information effectively.

Key words: secret information annotation, deep learning, Chinese word segmentation, word embedding, enterprise secrets

中图分类号: 

  • TP305