吉林大学学报(工学版) ›› 2024, Vol. 54 ›› Issue (9): 2631-2637.doi: 10.13229/j.cnki.jdxbgxb.20231214
• 计算机科学与技术 • 上一篇
Lei ZHANG1(),Jing JIAO2(),Bo-xin LI1,Yan-jie ZHOU1
摘要:
由于半结构化数据具有很高的数据异构性,并且数据量巨大,不同来源的数据结构不一致,导致数据抽取的准确性和完整性较低。为此,本文将机器学习和深度学习深度融合,提出一种针对大容量半结构化数据的抽取算法。利用机器学习的主成分分析法,降低大容量半结构化数据的维度。基于深度学习的转换器网络结构,分别改进嵌入层、编码层-解码层和编码层等部分,得到用于识别数据命名实体和抽取数据实体关系的两种数据抽取算法,实现大容量半结构化数据的抽取。经测试结果验证,所提算法的正确抽取成效显著,无效数据项的最小抽取量仅有4个,且抽取复杂度较低,时效价值较高,F值和抽取时间的消融实验结果充分证明,两种技术的融合对数据抽取研究意义重大,F值始终保持在92以上,抽取时间缩短至125 ms内,具备较强的可行性,为提升运营效率、优化资源配置提供重要手段。
中图分类号:
1 | Jang H, Jeong Y, Yoon B.TechWord: development of a technology lexical database for structuring textual technology information based on natural language processing[J].Expert Systems with Applications, 2021, 164(2): No.114042. |
2 | 艾志玮, 冷珏琳, 夏芳, 等.面向精度可控的大规模结构化数据集约减方法[J].计算机辅助设计与图形学学报, 2021, 33(12): 1795-1802. |
Ai Zhi-wei, Leng Jue-lin, Xia Fang, et al. Error-controlled data reduction approach for large-scale structured datasets[J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(12): 1795-1802. | |
3 | 刘金培, 罗瑞, 陈华友, 等.非结构化数据驱动的混合二次分解汇率区间多尺度组合预测[J]. 中国管理科学, 2023, 31(6): 60-70. |
Liu Jin-pei, Luo Rui, Chen Hua-you, et al. Multi-scale combination forecasting of interval exchange rate with hybrid secondary decomposition driven by unstructured data[J]. Chinese Journal of Management Science, 2023, 31(6): 60-70. | |
4 | Non S, Katechan J, Taravichet T,et al.Intelligent approach to automated star-schema construction using a knowledge base[J].Expert Systems with Applications, 2021, 182(1): No.115226. |
5 | Barbella M, Tortora G.A semi-automatic data integration process of heterogeneous databases[J].Pattern recognition letters, 2023, 166(2): 134-142. |
6 | 朱小龙, 谢忠. 基于机器学习的地理空间数据抽取算法[J].吉林大学学报: 工学版, 2021, 51(3): 1011-1016. |
Zhu Xiao-long, Xie Zhong. Geospatial data extraction algorithm based on machine learning[J]. Journal of Jilin University(Engineering and Technology Edition), 2021,51(3): 1011-1016. | |
7 | 孙学军, 李长银. 物联网通信大数据库半结构化数据识别方法[J]. 计算机仿真, 2021, 38(11): 323-326. |
Sun Xue-jun, Li Chang-yin. semi structured data identification method of internet of things communication large database[J]. Computer Simulation, 2021, 38(11): 323-326. | |
8 | 陈勇, 邢欣, 张锦文. 面向文书的情报关键信息抽取算法[J]. 火力与指挥控制, 2023, 48(1): 142-148, 157. |
Chen Yong, Xing Xin, Zhang Jin-wen. Document-oriented intelligence key information extraction algorithm[J]. Fire Control & Command Control,2023, 48(1): 142-148, 157. | |
9 | Jung D H, Kim Y, Cho H H.Automatic quantification of living cells via a noninvasive achromatic colorimetric sensor through machine learning-assisted image analysis using a smartphone[J].Chemical Engineering Journal, 2022, 450: 2-13. |
10 | 顾贞,张国印,马春光,等.基于概率主成分分析的差分隐私数据发布方法[J].哈尔滨工程大学学报, 2021, 42(8): 1217-1223. |
Gu Zhen, Zhang Guo-yin, Ma Chun-guang, et al. Differential privacy data publishing method based on the probabilistic principal component analysis[J]. Journal of Harbin Engineering University, 2021, 42(8): 1217-1223. | |
11 | 吴家皋,杨璐,翁玮薇,等. 基于K-means的深度跨模态哈希量化优化方法[J]. 南京航空航天大学学报, 2021, 53(5): 684-691. |
Wu Jia-gao, Yang Lu, Weng Wei-wei, et al. K-means based quantitative optimization method for deep cross modal hashing [J]. Journal of Nanjing University of Aeronautics & Astronautics, 2021, 53(5): 684-691. | |
12 | 孙林,施恩惠,司珊珊,等.基于AP聚类和互信息的弱标记特征选择方法[J].南京师大学报:自然科学版, 2022, 45(3): 108-115. |
Sun Lin, Shi En-hui, Si Shan-shan, et al. Weak label feature selection method based on AP clustering and mutual information[J]. Journal of Nanjing Normal University(Natural Science Edition), 2022, 45 (3): 108-115. | |
13 | Yan M, Feng J, Xu S X.Interval-valued intuitionistic pure linguistic entropy weight method and its application to group decision-making[J]. Journal of Intelligent and Fuzzy Systems, 2021, 41(4): 1-16. |
14 | 刘彬彬, 凤维杰, 郑启龙, 等. 神经网络化简非多项式混合布尔算术表达式[J]. 小型微型计算机系统, 2023, 44(3): 449-455. |
Liu Bin-bin, Feng Wei-jie, Zheng Qi-long, et al. Simplifying non-polynomial mixed Boolean-arithmetic expressions by neural network [J]. Journal of Chinese Computer Systems, 2023, 44 (3): 449-455. | |
15 | 徐红, 矫桂娥, 张文俊, 等. 基于卷积神经网络的结构化非平衡数据分类算法[J].计算机工程, 2023, 49(2): 81-89. |
Xu Hong, Jiao Gui-e, Zhang Wen-jun, et al. Classification algorithm for structured imbalanced data based on convolutional neural network [J]. Computer Engineering, 2023, 49 (2): 81-89. | |
16 | 蒋保洋,但志平,董方敏,等. 基于双预训练Transformer和交叉注意力的多模态谣言检测[J]. 国外电子测量技术, 2023, 42(4): 149-157. |
Jiang Bao-yang, Dan Zhi-ping, Dong Fang-min, et al. Multimodal rumor detection method based on dual pre-trained Transformer and cross attention mechanism[J]. Foreign Electronic Measurement Technology, 2023, 42(4): 149-157. | |
17 | 邓维斌,王智莹,高荣壕,等. 融合注意力与CorNet的多标签文本分类[J]. 西北大学学报: 自然科学版, 2022, 52(5): 824-833. |
Deng Wei-bin, Wang Zhi-ying, Gao Rong-hao, et al. Multi-label text classification combining attention with CorNet[J]. Journal of Northwest University(Natural Science Edition), 2022, 52(5): 824-833. | |
18 | Li D, Yan L, Yang J,et al.Dependency syntax guided BERT-BiLSTM-GAM-CRF for Chinese NER[J].Expert Systems with Applications, 2022, 196(6):No.116682. |
19 | Li Z, Xie X, Ling F,et al.Matching images and texts with multi-head attention network for cross-media hashing retrieval[J].Engineering Applications of Artificial Intelligence, 2021, 106: No.104475. |
20 | 由丽萍,刘越,王世兴. 融合自注意力机制和语义词典的危机情绪分类研究[J]. 情报理论与实践, 2022,45(5): 189-195. |
Yu Li-ping, Liu Yue, Wang Shi-xing. Crisis emotion classification combining self-attention and semantic dictionary[J]. Information Studies:Theory & Application, 2022,45(5): 189-195. | |
21 | 孟金旭,单鸿涛,黄润才,等.基于XLNet的双通道特征融合文本分类模型[J]. 山东大学学报: 理学版,2023, 58(5): 36-45. |
Meng Jin-xu, Shan Hong-tao, Huang Run-cai, et al. Text classification model based on dual-channel feature fusion based on XLNet[J]. Journal of Shandong University(Natural Science), 2023,58(5): 36-45. | |
22 | 刘军,王慧民,张兴忠,等.基于Transformer的端到端路面裂缝检测方法[J].太原理工大学学报, 2022, 53(6): 1143-1151. |
Liu Jun, Wang Hui-min, Zhang Xing-zhong, et al. End-to-end pavement crack detection method based on Transformer [J]. Journal of Taiyuan University of Technology, 2022, 53(6): 1143-1151. |
[1] | 乔百友,武彤,杨璐,蒋有文. 一种基于BiGRU和胶囊网络的文本情感分析方法[J]. 吉林大学学报(工学版), 2024, 54(7): 2026-2037. |
[2] | 郭昕刚,何颖晨,程超. 抗噪声的分步式图像超分辨率重构算法[J]. 吉林大学学报(工学版), 2024, 54(7): 2063-2071. |
[3] | 张丽平,刘斌毓,李松,郝忠孝. 基于稀疏多头自注意力的轨迹kNN查询方法[J]. 吉林大学学报(工学版), 2024, 54(6): 1756-1766. |
[4] | 孙铭会,薛浩,金玉波,曲卫东,秦贵和. 联合时空注意力的视频显著性预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1767-1776. |
[5] | 陈城,史培新,贾鹏蛟,董曼曼. 基于MK-LSTM算法的盾构掘进参数相关性分析及结构变形预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1624-1633. |
[6] | 陆玉凯,袁帅科,熊树生,朱绍鹏,张宁. 汽车漆面缺陷高精度检测系统[J]. 吉林大学学报(工学版), 2024, 54(5): 1205-1213. |
[7] | 李雄飞,宋紫萱,朱芮,张小利. 基于多尺度融合的遥感图像变化检测模型[J]. 吉林大学学报(工学版), 2024, 54(2): 516-523. |
[8] | 杨国俊,齐亚辉,石秀名. 基于数字图像技术的桥梁裂缝检测综述[J]. 吉林大学学报(工学版), 2024, 54(2): 313-332. |
[9] | 付忠良,陈晓清,任伟,姚宇. 带学习过程的随机K最近邻算法[J]. 吉林大学学报(工学版), 2024, 54(1): 209-220. |
[10] | 陈岳林,高铸成,蔡晓东. 基于BERT与密集复合网络的长文本语义匹配模型[J]. 吉林大学学报(工学版), 2024, 54(1): 232-239. |
[11] | 霍光,林大为,刘元宁,朱晓冬,袁梦,盖迪. 基于多尺度特征和注意力机制的轻量级虹膜分割模型[J]. 吉林大学学报(工学版), 2023, 53(9): 2591-2600. |
[12] | 金小俊,孙艳霞,于佳琳,陈勇. 基于深度学习与图像处理的蔬菜苗期杂草识别方法[J]. 吉林大学学报(工学版), 2023, 53(8): 2421-2429. |
[13] | 耿庆田,刘植,李清亮,于繁华,李晓宁. 基于一种深度学习模型的土壤湿度预测[J]. 吉林大学学报(工学版), 2023, 53(8): 2430-2436. |
[14] | 车翔玖,徐欢,潘明阳,刘全乐. 生物医学命名实体识别的两阶段学习算法[J]. 吉林大学学报(工学版), 2023, 53(8): 2380-2387. |
[15] | 巫威眺,曾坤,周伟,李鹏,靳文舟. 基于多源数据和响应面优化的公交客流预测深度学习方法[J]. 吉林大学学报(工学版), 2023, 53(7): 2001-2015. |
|