吉林大学学报(工学版) ›› 2024, Vol. 54 ›› Issue (9): 2631-2637.doi: 10.13229/j.cnki.jdxbgxb.20231214

• 计算机科学与技术 • 上一篇    

融合机器学习和深度学习的大容量半结构化数据抽取算法

张磊1(),焦晶2(),李勃昕1,周延杰1   

  1. 1.西安财经大学 信息学院,西安 710100
    2.西北大学 经济管理学院,西安 710127
  • 出版日期:2024-09-01 发布日期:2024-10-28
  • 通讯作者: 焦晶 E-mail:leiqsx963@163.com;jiao369jing@126.com
  • 作者简介:张磊(1984-),男,高级工程师,博士.研究方向:数据科学,机器学习.E-mail:leiqsx963@163.com
  • 基金资助:
    中国(西安)丝绸之路研究院纵向项目(2019HZ02);中国(西安)丝绸之路研究院纵向项目(2017SY05);西安财经大学横向项目(2022250)

Large capacity semi structured data extraction algorithm combining machine learning and deep learning

Lei ZHANG1(),Jing JIAO2(),Bo-xin LI1,Yan-jie ZHOU1   

  1. 1.School of Information,Xi'an University of Finance and Economics,Xi'an 710100,China
    2.School of Economics & Management,Northwest University,Xi'an 710127,China
  • Online:2024-09-01 Published:2024-10-28
  • Contact: Jing JIAO E-mail:leiqsx963@163.com;jiao369jing@126.com

摘要:

由于半结构化数据具有很高的数据异构性,并且数据量巨大,不同来源的数据结构不一致,导致数据抽取的准确性和完整性较低。为此,本文将机器学习和深度学习深度融合,提出一种针对大容量半结构化数据的抽取算法。利用机器学习的主成分分析法,降低大容量半结构化数据的维度。基于深度学习的转换器网络结构,分别改进嵌入层、编码层-解码层和编码层等部分,得到用于识别数据命名实体和抽取数据实体关系的两种数据抽取算法,实现大容量半结构化数据的抽取。经测试结果验证,所提算法的正确抽取成效显著,无效数据项的最小抽取量仅有4个,且抽取复杂度较低,时效价值较高,F值和抽取时间的消融实验结果充分证明,两种技术的融合对数据抽取研究意义重大,F值始终保持在92以上,抽取时间缩短至125 ms内,具备较强的可行性,为提升运营效率、优化资源配置提供重要手段。

关键词: 半结构化数据, 机器学习, 数据容量降维, 深度学习, 命名实体识别, 实体关系抽取

Abstract:

Due to the high heterogeneity of semi-structured data and the huge amount of data, the data structure of different sources is inconsistent, resulting in low accuracy and integrity of data extraction. To this end, machine learning and deep learning are deeply integrated, and an extraction algorithm for large-capacity semi-structured data is proposed. By using the principal component analysis method of machine learning, the dimensionality of large volume semi-structured data is reduced. The converter network structure based on deep learning improves the embedding layer, encoding layer - decoding layer and encoding layer respectively, and obtains two kinds of data extraction algorithms for identifying the named entity of data and extracting the relationship of data entity, so as to realize the extraction of large-capacity semi-structured data. The test results verify that the proposed algorithm has a significant effect on correct extraction, the minimum extraction amount of invalid data items is only 4, the extraction complexity is low, and the aging value is high. The ablation experiment results of F-value and extraction time fully prove that the fusion of the two technologies is of great significance to the research of data extraction, and the F-value is always kept above 92, and the extraction time is shortened to 125 ms. It has strong feasibility and provides an important means for improving operational efficiency and optimizing resource allocation.

Key words: semi-structured data, machine learning, data capacity dimensionality reduction, deep learning, named entity recognition, entity relationship extraction

中图分类号: 

  • G255

图1

用于识别命名实体的改进Transformer算法架构"

图2

用于抽取实体关系的改进Transformer算法架构"

表1

抽取算法主要参数"

名称数值
编码层数6层
嵌入层维度512维
前馈网络层维度2 048维
注意力机制头数8头
解码层数6层
神经元丢弃率0.1
BiLSTM网络层维度128维

图3

半结构化数据正确抽取量"

图4

半结构化数据抽取复杂度"

图5

半结构化数据抽取消融实验"

1 Jang H, Jeong Y, Yoon B.TechWord: development of a technology lexical database for structuring textual technology information based on natural language processing[J].Expert Systems with Applications, 2021, 164(2): No.114042.
2 艾志玮, 冷珏琳, 夏芳, 等.面向精度可控的大规模结构化数据集约减方法[J].计算机辅助设计与图形学学报, 2021, 33(12): 1795-1802.
Ai Zhi-wei, Leng Jue-lin, Xia Fang, et al. Error-controlled data reduction approach for large-scale structured datasets[J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(12): 1795-1802.
3 刘金培, 罗瑞, 陈华友, 等.非结构化数据驱动的混合二次分解汇率区间多尺度组合预测[J]. 中国管理科学, 2023, 31(6): 60-70.
Liu Jin-pei, Luo Rui, Chen Hua-you, et al. Multi-scale combination forecasting of interval exchange rate with hybrid secondary decomposition driven by unstructured data[J]. Chinese Journal of Management Science, 2023, 31(6): 60-70.
4 Non S, Katechan J, Taravichet T,et al.Intelligent approach to automated star-schema construction using a knowledge base[J].Expert Systems with Applications, 2021, 182(1): No.115226.
5 Barbella M, Tortora G.A semi-automatic data integration process of heterogeneous databases[J].Pattern recognition letters, 2023, 166(2): 134-142.
6 朱小龙, 谢忠. 基于机器学习的地理空间数据抽取算法[J].吉林大学学报: 工学版, 2021, 51(3): 1011-1016.
Zhu Xiao-long, Xie Zhong. Geospatial data extraction algorithm based on machine learning[J]. Journal of Jilin University(Engineering and Technology Edition), 2021,51(3): 1011-1016.
7 孙学军, 李长银. 物联网通信大数据库半结构化数据识别方法[J]. 计算机仿真, 2021, 38(11): 323-326.
Sun Xue-jun, Li Chang-yin. semi structured data identification method of internet of things communication large database[J]. Computer Simulation, 2021, 38(11): 323-326.
8 陈勇, 邢欣, 张锦文. 面向文书的情报关键信息抽取算法[J]. 火力与指挥控制, 2023, 48(1): 142-148, 157.
Chen Yong, Xing Xin, Zhang Jin-wen. Document-oriented intelligence key information extraction algorithm[J]. Fire Control & Command Control,2023, 48(1): 142-148, 157.
9 Jung D H, Kim Y, Cho H H.Automatic quantification of living cells via a noninvasive achromatic colorimetric sensor through machine learning-assisted image analysis using a smartphone[J].Chemical Engineering Journal, 2022, 450: 2-13.
10 顾贞,张国印,马春光,等.基于概率主成分分析的差分隐私数据发布方法[J].哈尔滨工程大学学报, 2021, 42(8): 1217-1223.
Gu Zhen, Zhang Guo-yin, Ma Chun-guang, et al. Differential privacy data publishing method based on the probabilistic principal component analysis[J]. Journal of Harbin Engineering University, 2021, 42(8): 1217-1223.
11 吴家皋,杨璐,翁玮薇,等. 基于K-means的深度跨模态哈希量化优化方法[J]. 南京航空航天大学学报, 2021, 53(5): 684-691.
Wu Jia-gao, Yang Lu, Weng Wei-wei, et al. K-means based quantitative optimization method for deep cross modal hashing [J]. Journal of Nanjing University of Aeronautics & Astronautics, 2021, 53(5): 684-691.
12 孙林,施恩惠,司珊珊,等.基于AP聚类和互信息的弱标记特征选择方法[J].南京师大学报:自然科学版, 2022, 45(3): 108-115.
Sun Lin, Shi En-hui, Si Shan-shan, et al. Weak label feature selection method based on AP clustering and mutual information[J]. Journal of Nanjing Normal University(Natural Science Edition), 2022, 45 (3): 108-115.
13 Yan M, Feng J, Xu S X.Interval-valued intuitionistic pure linguistic entropy weight method and its application to group decision-making[J]. Journal of Intelligent and Fuzzy Systems, 2021, 41(4): 1-16.
14 刘彬彬, 凤维杰, 郑启龙, 等. 神经网络化简非多项式混合布尔算术表达式[J]. 小型微型计算机系统, 2023, 44(3): 449-455.
Liu Bin-bin, Feng Wei-jie, Zheng Qi-long, et al. Simplifying non-polynomial mixed Boolean-arithmetic expressions by neural network [J]. Journal of Chinese Computer Systems, 2023, 44 (3): 449-455.
15 徐红, 矫桂娥, 张文俊, 等. 基于卷积神经网络的结构化非平衡数据分类算法[J].计算机工程, 2023, 49(2): 81-89.
Xu Hong, Jiao Gui-e, Zhang Wen-jun, et al. Classification algorithm for structured imbalanced data based on convolutional neural network [J]. Computer Engineering, 2023, 49 (2): 81-89.
16 蒋保洋,但志平,董方敏,等. 基于双预训练Transformer和交叉注意力的多模态谣言检测[J]. 国外电子测量技术, 2023, 42(4): 149-157.
Jiang Bao-yang, Dan Zhi-ping, Dong Fang-min, et al. Multimodal rumor detection method based on dual pre-trained Transformer and cross attention mechanism[J]. Foreign Electronic Measurement Technology, 2023, 42(4): 149-157.
17 邓维斌,王智莹,高荣壕,等. 融合注意力与CorNet的多标签文本分类[J]. 西北大学学报: 自然科学版, 2022, 52(5): 824-833.
Deng Wei-bin, Wang Zhi-ying, Gao Rong-hao, et al. Multi-label text classification combining attention with CorNet[J]. Journal of Northwest University(Natural Science Edition), 2022, 52(5): 824-833.
18 Li D, Yan L, Yang J,et al.Dependency syntax guided BERT-BiLSTM-GAM-CRF for Chinese NER[J].Expert Systems with Applications, 2022, 196(6):No.116682.
19 Li Z, Xie X, Ling F,et al.Matching images and texts with multi-head attention network for cross-media hashing retrieval[J].Engineering Applications of Artificial Intelligence, 2021, 106: No.104475.
20 由丽萍,刘越,王世兴. 融合自注意力机制和语义词典的危机情绪分类研究[J]. 情报理论与实践, 2022,45(5): 189-195.
Yu Li-ping, Liu Yue, Wang Shi-xing. Crisis emotion classification combining self-attention and semantic dictionary[J]. Information Studies:Theory & Application, 2022,45(5): 189-195.
21 孟金旭,单鸿涛,黄润才,等.基于XLNet的双通道特征融合文本分类模型[J]. 山东大学学报: 理学版,2023, 58(5): 36-45.
Meng Jin-xu, Shan Hong-tao, Huang Run-cai, et al. Text classification model based on dual-channel feature fusion based on XLNet[J]. Journal of Shandong University(Natural Science), 2023,58(5): 36-45.
22 刘军,王慧民,张兴忠,等.基于Transformer的端到端路面裂缝检测方法[J].太原理工大学学报, 2022, 53(6): 1143-1151.
Liu Jun, Wang Hui-min, Zhang Xing-zhong, et al. End-to-end pavement crack detection method based on Transformer [J]. Journal of Taiyuan University of Technology, 2022, 53(6): 1143-1151.
[1] 乔百友,武彤,杨璐,蒋有文. 一种基于BiGRU和胶囊网络的文本情感分析方法[J]. 吉林大学学报(工学版), 2024, 54(7): 2026-2037.
[2] 郭昕刚,何颖晨,程超. 抗噪声的分步式图像超分辨率重构算法[J]. 吉林大学学报(工学版), 2024, 54(7): 2063-2071.
[3] 张丽平,刘斌毓,李松,郝忠孝. 基于稀疏多头自注意力的轨迹kNN查询方法[J]. 吉林大学学报(工学版), 2024, 54(6): 1756-1766.
[4] 孙铭会,薛浩,金玉波,曲卫东,秦贵和. 联合时空注意力的视频显著性预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1767-1776.
[5] 陈城,史培新,贾鹏蛟,董曼曼. 基于MK-LSTM算法的盾构掘进参数相关性分析及结构变形预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1624-1633.
[6] 陆玉凯,袁帅科,熊树生,朱绍鹏,张宁. 汽车漆面缺陷高精度检测系统[J]. 吉林大学学报(工学版), 2024, 54(5): 1205-1213.
[7] 李雄飞,宋紫萱,朱芮,张小利. 基于多尺度融合的遥感图像变化检测模型[J]. 吉林大学学报(工学版), 2024, 54(2): 516-523.
[8] 杨国俊,齐亚辉,石秀名. 基于数字图像技术的桥梁裂缝检测综述[J]. 吉林大学学报(工学版), 2024, 54(2): 313-332.
[9] 付忠良,陈晓清,任伟,姚宇. 带学习过程的随机K最近邻算法[J]. 吉林大学学报(工学版), 2024, 54(1): 209-220.
[10] 陈岳林,高铸成,蔡晓东. 基于BERT与密集复合网络的长文本语义匹配模型[J]. 吉林大学学报(工学版), 2024, 54(1): 232-239.
[11] 霍光,林大为,刘元宁,朱晓冬,袁梦,盖迪. 基于多尺度特征和注意力机制的轻量级虹膜分割模型[J]. 吉林大学学报(工学版), 2023, 53(9): 2591-2600.
[12] 金小俊,孙艳霞,于佳琳,陈勇. 基于深度学习与图像处理的蔬菜苗期杂草识别方法[J]. 吉林大学学报(工学版), 2023, 53(8): 2421-2429.
[13] 耿庆田,刘植,李清亮,于繁华,李晓宁. 基于一种深度学习模型的土壤湿度预测[J]. 吉林大学学报(工学版), 2023, 53(8): 2430-2436.
[14] 车翔玖,徐欢,潘明阳,刘全乐. 生物医学命名实体识别的两阶段学习算法[J]. 吉林大学学报(工学版), 2023, 53(8): 2380-2387.
[15] 巫威眺,曾坤,周伟,李鹏,靳文舟. 基于多源数据和响应面优化的公交客流预测深度学习方法[J]. 吉林大学学报(工学版), 2023, 53(7): 2001-2015.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!