Journal of Jilin University(Engineering and Technology Edition) ›› 2024, Vol. 54 ›› Issue (9): 2631-2637.doi: 10.13229/j.cnki.jdxbgxb.20231214

Previous Articles     Next Articles

Large capacity semi structured data extraction algorithm combining machine learning and deep learning

Lei ZHANG1(),Jing JIAO2(),Bo-xin LI1,Yan-jie ZHOU1   

  1. 1.School of Information,Xi'an University of Finance and Economics,Xi'an 710100,China
    2.School of Economics & Management,Northwest University,Xi'an 710127,China
  • Received:2023-11-07 Online:2024-09-01 Published:2024-10-28
  • Contact: Jing JIAO E-mail:leiqsx963@163.com;jiao369jing@126.com

Abstract:

Due to the high heterogeneity of semi-structured data and the huge amount of data, the data structure of different sources is inconsistent, resulting in low accuracy and integrity of data extraction. To this end, machine learning and deep learning are deeply integrated, and an extraction algorithm for large-capacity semi-structured data is proposed. By using the principal component analysis method of machine learning, the dimensionality of large volume semi-structured data is reduced. The converter network structure based on deep learning improves the embedding layer, encoding layer - decoding layer and encoding layer respectively, and obtains two kinds of data extraction algorithms for identifying the named entity of data and extracting the relationship of data entity, so as to realize the extraction of large-capacity semi-structured data. The test results verify that the proposed algorithm has a significant effect on correct extraction, the minimum extraction amount of invalid data items is only 4, the extraction complexity is low, and the aging value is high. The ablation experiment results of F-value and extraction time fully prove that the fusion of the two technologies is of great significance to the research of data extraction, and the F-value is always kept above 92, and the extraction time is shortened to 125 ms. It has strong feasibility and provides an important means for improving operational efficiency and optimizing resource allocation.

Key words: semi-structured data, machine learning, data capacity dimensionality reduction, deep learning, named entity recognition, entity relationship extraction

CLC Number: 

  • G255

Fig.1

Improved Transformer algorithm architecture for identifying named entities"

Fig.2

Improved Transformer algorithm architecture for extracting entity relationships"

Tab 1

Main parameters of extraction algorithm"

名称数值
编码层数6层
嵌入层维度512维
前馈网络层维度2 048维
注意力机制头数8头
解码层数6层
神经元丢弃率0.1
BiLSTM网络层维度128维

Fig.3

Correct extraction quantity of semi-structured data"

Fig.4

Complexity of semi-structured data extraction"

Fig.5

Semi-structured data extraction and ablation experiment"

1 Jang H, Jeong Y, Yoon B.TechWord: development of a technology lexical database for structuring textual technology information based on natural language processing[J].Expert Systems with Applications, 2021, 164(2): No.114042.
2 艾志玮, 冷珏琳, 夏芳, 等.面向精度可控的大规模结构化数据集约减方法[J].计算机辅助设计与图形学学报, 2021, 33(12): 1795-1802.
Ai Zhi-wei, Leng Jue-lin, Xia Fang, et al. Error-controlled data reduction approach for large-scale structured datasets[J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(12): 1795-1802.
3 刘金培, 罗瑞, 陈华友, 等.非结构化数据驱动的混合二次分解汇率区间多尺度组合预测[J]. 中国管理科学, 2023, 31(6): 60-70.
Liu Jin-pei, Luo Rui, Chen Hua-you, et al. Multi-scale combination forecasting of interval exchange rate with hybrid secondary decomposition driven by unstructured data[J]. Chinese Journal of Management Science, 2023, 31(6): 60-70.
4 Non S, Katechan J, Taravichet T,et al.Intelligent approach to automated star-schema construction using a knowledge base[J].Expert Systems with Applications, 2021, 182(1): No.115226.
5 Barbella M, Tortora G.A semi-automatic data integration process of heterogeneous databases[J].Pattern recognition letters, 2023, 166(2): 134-142.
6 朱小龙, 谢忠. 基于机器学习的地理空间数据抽取算法[J].吉林大学学报: 工学版, 2021, 51(3): 1011-1016.
Zhu Xiao-long, Xie Zhong. Geospatial data extraction algorithm based on machine learning[J]. Journal of Jilin University(Engineering and Technology Edition), 2021,51(3): 1011-1016.
7 孙学军, 李长银. 物联网通信大数据库半结构化数据识别方法[J]. 计算机仿真, 2021, 38(11): 323-326.
Sun Xue-jun, Li Chang-yin. semi structured data identification method of internet of things communication large database[J]. Computer Simulation, 2021, 38(11): 323-326.
8 陈勇, 邢欣, 张锦文. 面向文书的情报关键信息抽取算法[J]. 火力与指挥控制, 2023, 48(1): 142-148, 157.
Chen Yong, Xing Xin, Zhang Jin-wen. Document-oriented intelligence key information extraction algorithm[J]. Fire Control & Command Control,2023, 48(1): 142-148, 157.
9 Jung D H, Kim Y, Cho H H.Automatic quantification of living cells via a noninvasive achromatic colorimetric sensor through machine learning-assisted image analysis using a smartphone[J].Chemical Engineering Journal, 2022, 450: 2-13.
10 顾贞,张国印,马春光,等.基于概率主成分分析的差分隐私数据发布方法[J].哈尔滨工程大学学报, 2021, 42(8): 1217-1223.
Gu Zhen, Zhang Guo-yin, Ma Chun-guang, et al. Differential privacy data publishing method based on the probabilistic principal component analysis[J]. Journal of Harbin Engineering University, 2021, 42(8): 1217-1223.
11 吴家皋,杨璐,翁玮薇,等. 基于K-means的深度跨模态哈希量化优化方法[J]. 南京航空航天大学学报, 2021, 53(5): 684-691.
Wu Jia-gao, Yang Lu, Weng Wei-wei, et al. K-means based quantitative optimization method for deep cross modal hashing [J]. Journal of Nanjing University of Aeronautics & Astronautics, 2021, 53(5): 684-691.
12 孙林,施恩惠,司珊珊,等.基于AP聚类和互信息的弱标记特征选择方法[J].南京师大学报:自然科学版, 2022, 45(3): 108-115.
Sun Lin, Shi En-hui, Si Shan-shan, et al. Weak label feature selection method based on AP clustering and mutual information[J]. Journal of Nanjing Normal University(Natural Science Edition), 2022, 45 (3): 108-115.
13 Yan M, Feng J, Xu S X.Interval-valued intuitionistic pure linguistic entropy weight method and its application to group decision-making[J]. Journal of Intelligent and Fuzzy Systems, 2021, 41(4): 1-16.
14 刘彬彬, 凤维杰, 郑启龙, 等. 神经网络化简非多项式混合布尔算术表达式[J]. 小型微型计算机系统, 2023, 44(3): 449-455.
Liu Bin-bin, Feng Wei-jie, Zheng Qi-long, et al. Simplifying non-polynomial mixed Boolean-arithmetic expressions by neural network [J]. Journal of Chinese Computer Systems, 2023, 44 (3): 449-455.
15 徐红, 矫桂娥, 张文俊, 等. 基于卷积神经网络的结构化非平衡数据分类算法[J].计算机工程, 2023, 49(2): 81-89.
Xu Hong, Jiao Gui-e, Zhang Wen-jun, et al. Classification algorithm for structured imbalanced data based on convolutional neural network [J]. Computer Engineering, 2023, 49 (2): 81-89.
16 蒋保洋,但志平,董方敏,等. 基于双预训练Transformer和交叉注意力的多模态谣言检测[J]. 国外电子测量技术, 2023, 42(4): 149-157.
Jiang Bao-yang, Dan Zhi-ping, Dong Fang-min, et al. Multimodal rumor detection method based on dual pre-trained Transformer and cross attention mechanism[J]. Foreign Electronic Measurement Technology, 2023, 42(4): 149-157.
17 邓维斌,王智莹,高荣壕,等. 融合注意力与CorNet的多标签文本分类[J]. 西北大学学报: 自然科学版, 2022, 52(5): 824-833.
Deng Wei-bin, Wang Zhi-ying, Gao Rong-hao, et al. Multi-label text classification combining attention with CorNet[J]. Journal of Northwest University(Natural Science Edition), 2022, 52(5): 824-833.
18 Li D, Yan L, Yang J,et al.Dependency syntax guided BERT-BiLSTM-GAM-CRF for Chinese NER[J].Expert Systems with Applications, 2022, 196(6):No.116682.
19 Li Z, Xie X, Ling F,et al.Matching images and texts with multi-head attention network for cross-media hashing retrieval[J].Engineering Applications of Artificial Intelligence, 2021, 106: No.104475.
20 由丽萍,刘越,王世兴. 融合自注意力机制和语义词典的危机情绪分类研究[J]. 情报理论与实践, 2022,45(5): 189-195.
Yu Li-ping, Liu Yue, Wang Shi-xing. Crisis emotion classification combining self-attention and semantic dictionary[J]. Information Studies:Theory & Application, 2022,45(5): 189-195.
21 孟金旭,单鸿涛,黄润才,等.基于XLNet的双通道特征融合文本分类模型[J]. 山东大学学报: 理学版,2023, 58(5): 36-45.
Meng Jin-xu, Shan Hong-tao, Huang Run-cai, et al. Text classification model based on dual-channel feature fusion based on XLNet[J]. Journal of Shandong University(Natural Science), 2023,58(5): 36-45.
22 刘军,王慧民,张兴忠,等.基于Transformer的端到端路面裂缝检测方法[J].太原理工大学学报, 2022, 53(6): 1143-1151.
Liu Jun, Wang Hui-min, Zhang Xing-zhong, et al. End-to-end pavement crack detection method based on Transformer [J]. Journal of Taiyuan University of Technology, 2022, 53(6): 1143-1151.
[1] Lu Li,Jun-qi Song,Ming Zhu,He-qun Tan,Yu-fan Zhou,Chao-qi Sun,Cheng-yu Zhou. Object extraction of yellow catfish based on RGHS image enhancement and improved YOLOv5 network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(9): 2638-2645.
[2] Bai-you QIAO,Tong WU,Lu YANG,You-wen JIANG. A text sentiment analysis method based on BiGRU and capsule network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(7): 2026-2037.
[3] Xin-gang GUO,Ying-chen HE,Chao CHENG. Noise-resistant multistep image super resolution network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(7): 2063-2071.
[4] Li-ping ZHANG,Bin-yu LIU,Song LI,Zhong-xiao HAO. Trajectory k nearest neighbor query method based on sparse multi-head attention [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1756-1766.
[5] Cheng CHEN,Pei-xin SHI,Peng-jiao JIA,Man-man DONG. Correlation analysis of shield driving parameters and structural deformation prediction based on MK-LSTM algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1624-1633.
[6] Ming-hui SUN,Hao XUE,Yu-bo JIN,Wei-dong QU,Gui-he QIN. Video saliency prediction with collective spatio-temporal attention [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(6): 1767-1776.
[7] Yu-kai LU,Shuai-ke YUAN,Shu-sheng XIONG,Shao-peng ZHU,Ning ZHANG. High precision detection system for automotive paint defects [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(5): 1205-1213.
[8] Xiong-fei LI,Zi-xuan SONG,Rui ZHU,Xiao-li ZHANG. Remote sensing change detection model based on multi⁃scale fusion [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(2): 516-523.
[9] Guo-jun YANG,Ya-hui QI,Xiu-ming SHI. Review of bridge crack detection based on digital image technology [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(2): 313-332.
[10] Shi-feng NIU,Shi-jie YU,Yan-jun LIU,Chong MA. Real-time detection method of angry driving behavior based on bracelet data [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3505-3512.
[11] Bin ZHAO,Cheng-dong WU,Xue-jiao ZHANG,Ruo-huai SUN,Yang JIANG. Target grasping network technology of robot manipulator based on attention mechanism [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3423-3432.
[12] Yong WANG,Yu-xiao BIAN,Xin-chao LI,Chun-ming XU,Gang PENG,Ji-kui WANG. Image dehazing algorithm based on multiscale encoding decoding neural network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(12): 3626-3636.
[13] Na CHE,Yi-ming ZHU,Jian ZHAO,Lei SUN,Li-juan SHI,Xian-wei ZENG. Connectionism based audio-visual speech recognition method [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(10): 2984-2993.
[14] Yue-lin CHEN,Zhu-cheng GAO,Xiao-dong CAI. Long text semantic matching model based on BERT and dense composite network [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(1): 232-239.
[15] Zhong-liang FU,Xiao-qing CHEN,Wei REN,Yu YAO. Random K-nearest neighbor algorithm with learning process [J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(1): 209-220.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] MENG Ling-qi,DU Yong,MA Sheng-biao,GUO Bin . Nonlinearity of vertical vibration of medium plate mill [J]. 吉林大学学报(工学版), 2009, 39(03): 712 -0715 .
[2] ZHAO Shu-zhi,ZHU Yong-gang,ZHAO Bei. Study on capacity forecast of urban cars based on environmental protection[J]. 吉林大学学报(工学版), 2009, 39(增刊2): 191 -0193 .
[3] YUAN Yue-ming, GUAN Wei, QIU Wei. Map matching algorithm for inner suburban freeway based on handover location technique[J]. 吉林大学学报(工学版), 2011, 41(05): 1240 -1245 .
[4] Hu Zong-Jie,Wu Zhi-jun,Gao Guang-hai,Li Li-guang . Optimization of premixed diesel homogeneous charge preparation by spray hot-impingement[J]. 吉林大学学报(工学版), 2007, 37(01): 79 -84 .
[5] Wang Qing-nian,Zheng Jun-feng,Wang Wei-hua . New adaptive control strategy of parallel hybrid electric bus[J]. 吉林大学学报(工学版), 2008, 38(02): 249 -0253 .
[6] . [J]. 吉林大学学报(工学版), 2007, 37(06): 1392 -1396 .
[7] JIANG Feng-guo, ZHAO Jing-lu. Reliability analysis of reinforced concrete member under fire load[J]. 吉林大学学报(工学版), 2013, 43(06): 1500 -1503 .
[8] Wu Yun-zhu, He Bao-qin, Fu Li-min . Influence of velocity on transient aerodynamic characteristics
of overtaking and overtaken vehicles
[J]. 吉林大学学报(工学版), 2007, 37(05): 1009 -1013 .
[9] Tan Xiang-chen;Feng Tie;Luo Shu-tong;Li Da-li. Java source code change prediction based on design change analysis[J]. 吉林大学学报(工学版), 2008, 38(03): 685 -0689 .
[10] LI Feng, WANG Shuning. Solution for Route Traffic Flows Based on FrankWolfe Algorithm[J]. 吉林大学学报(工学版), 2005, 35(06): 632 -0636 .