›› 2012, Vol. ›› Issue (06): 1510-1514.

• 论文 • 上一篇    下一篇

基于关键字语义信息的XML文档分类

张利军, 李战怀, 陈群, 娄颖, 李宁   

  1. 西北工业大学 计算机科学和技术学院, 西安 710072
  • 收稿日期:2011-09-11 出版日期:2012-11-01
  • 通讯作者: 李战怀(1961-),男,教授,博士生导师.研究方向:数据管理,数据存储.E-mail:lizhh@nwpu.edu.cn E-mail:lizhh@nwpu.edu.cn
  • 基金资助:
    国家自然科学基金项目(60803043,60970070,61033007);"863"国家高技术研究发展计划项目(2009AA1Z134);"973"国家重点基础研究发展计划项目(2012CB316203).

Classifying XML documents based on term semantics

ZHANG Li-jun, LI Zhan-huai, CHEN Qun, LOU Ying, LI Ning   

  1. School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an 710072, China
  • Received:2011-09-11 Online:2012-11-01

摘要: 针对XML数据半结构化的特点及传统的tf-idf方法仅考虑关键字在文档中出现的频率和包含关键字的文档数,而未考虑XML文档中关键字语义信息的不足,提出了一种新的关键字权重度量方法。该方法充分考虑了XML文档中关键字所出现的路径、路径包含关键字的个数、包含路径的文档个数、路径的层次等影响关键字语义的因素,用于计算关键字权重,从而提高了关键字权重度量的准确性。在多个数据集上的实验结果表明,将该方法应用于XML文档的分类时,与传统的tf-idf方法和基于规则的方法相比,分类的查全率、查对率及F1均有所提高。

关键词: 计算机软件, 半结构化数据, XML挖掘, XML分类, 关键字语义, 权重度量

Abstract: Due to its semi-structured characteristic XML document implies rich term semantics in the structure information. It is inappropriate to measure XML term weight by general Term Frequency-Inverse Document Frequency (TF-IDF) approach. This is because that, when measuring the XML term weight, this approach only considers the term frequency and document frequency but ignores the term semantics implied in the structure information. A novel term weight measuring approach is proposed to overcome the above shortcoming and improve the performance. This new approach takes the factors affecting term semantics into account, such as the paths which contain terms, term frequency in a certain path, frequency of document which contains a certain path, depth of path, etc. Experimental results on several datasets show that, compared with TF-IDF and rule based approaches, the proposed approach can improve the recall, the precision and F1-measure in the classification of XML documents.

Key words: computer software, semi-structured data, XML mining, XML classification, term semantics, weight measurement

中图分类号: 

  • TP311
[1] Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34(1):1-47.
[2] Tekli J, Chbeir R, Yetongnon K. An overview on XML similarity: background, current trends and future directions[J]. Computer Science Review, 2009, 3(3):151-173.
[3] Xing G, Guo J, Xia Z H. Classifying XML docu-ments based on structure/content similar-ity[C]//The 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. Dagstuhl Castle, Berlin, Germany,2007:444-457.
[4] Dalamagas T, Cheng T, Winel K J, et al. A meth-odology for clustering XML documents by structure[J]. Information Systems, 2006, 31(3):187-228.
[5] Zaki M J, Aggarwal C C. XRules: an effective structural classifier for XML data[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington D C, 2003:316-325.
[6] Wu J W, Tang J. A bottom-up approach for XML documents classification[C]//Proceedings of the 12th International Database Engineering and Applications Symposium. Coimbra, Portugal, 2008:131-137.
[7] 杨建武. 基于核方法的XML文档自动分类[J]. 计算机学报, 2011, 34(2):353-359. Yang Jian-wu. XML document classification based on kernel method[J]. Chinese Journal of Computers, 2011, 34(2):353-359.
[8] Tagarelli A, Greco S. Semantic clustering of XML documents[J]. ACM Transactions on Information Systems, 2010, 28(1):1-56.
[9] Denoyer L, Gallinari P. The wikipedia XML corpus[J]. ACM SIGIR Forum, 2006, 40(1):64-69.
[10] Kurt A, Tozal E. Classification of XSLT-generated web documents with support vector machines[C]//Proceedings of the First International Workshop on Knowledge Discovery from XML Documents, Singapore, 2006:33-42.
[11] 孙涛, 李雄飞, 刘丽娟. 数据分布不敏感的决策树算法[J]. 吉林大学学报:工学版, 2009, 39(6):1607-1611. Sun Tao, Li Xiong-fei, Liu Li-juan. Algorithm of decision trees insensitive to data distribution[J]. Journal of Jilin University(Engineering and Technology Edition), 2009, 39(6):1607-1611.
[12] 袁正午,朱冠宇,丰江帆,等. 基于支持向量机的视频语义场景分割算法研究[J]. 重庆邮电大学学报:自然科学版,2010,22(4):458-463. Yuan Zheng-wu, Zhu Guan-yu, Feng Jiang-fan, et al.Research on the method of video semantic scene constructing based on SVM[J].Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), 2010,22(4):458-463.
[13] ]Machine Learning Group at National Taiwan University. Liblinear—a library for large linear classification [DB/OL]. [2010-09-25]. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
[1] 马健, 樊建平, 刘峰, 李红辉. 面向对象软件系统演化模型[J]. 吉林大学学报(工学版), 2018, 48(2): 545-550.
[2] 罗养霞, 郭晔. 基于数据依赖特征的软件识别[J]. 吉林大学学报(工学版), 2017, 47(6): 1894-1902.
[3] 应欢, 王东辉, 武成岗, 王喆, 唐博文, 李建军. 适用于商用系统环境的低开销确定性重放技术[J]. 吉林大学学报(工学版), 2017, 47(1): 208-217.
[4] 李勇, 黄志球, 王勇, 房丙午. 基于多源数据的跨项目软件缺陷预测[J]. 吉林大学学报(工学版), 2016, 46(6): 2034-2041.
[5] 王念滨, 祝官文, 周连科, 王红卫. 支持高效路径查询的数据空间索引方法[J]. 吉林大学学报(工学版), 2016, 46(3): 911-916.
[6] 特日跟, 江晟, 李雄飞, 李军. 基于整数数据的文档压缩编码方案[J]. 吉林大学学报(工学版), 2016, 46(1): 228-234.
[7] 康辉, 王家琦, 梅芳. 基于Pi演算的并行编程语言[J]. 吉林大学学报(工学版), 2016, 46(1): 235-241.
[8] 陈鹏飞, 田地, 杨光. 基于MVC架构的LIBS软件设计与实现[J]. 吉林大学学报(工学版), 2016, 46(1): 242-245.
[9] 刘磊, 王燕燕, 申春, 李玉祥, 刘雷. Bellman-Ford算法性能可移植的GPU并行优化[J]. 吉林大学学报(工学版), 2015, 45(5): 1559-1564.
[10] 冯晓宁, 王卓, 张旭. 基于L-π演算的WSN路由协议形式化方法[J]. 吉林大学学报(工学版), 2015, 45(5): 1565-1571.
[11] 李明哲, 王劲林, 陈晓, 陈君. 基于网络处理器的流媒体应用架构模型(VPL)[J]. 吉林大学学报(工学版), 2015, 45(5): 1572-1580.
[12] 王克朝, 王甜甜, 苏小红, 马培军. 基于频繁闭合序列模式挖掘的学生程序雷同检测[J]. 吉林大学学报(工学版), 2015, 45(4): 1260-1265.
[13] 黄宏涛,王静,叶海智,黄少滨. 基于惰性切片的线性时态逻辑性质验证[J]. 吉林大学学报(工学版), 2015, 45(1): 245-251.
[14] 范大娟1, 2, 黄志球1, 肖芳雄1, 祝义1, 王进1. 面向多服务交互的相容性分析与适配器生成[J]. 吉林大学学报(工学版), 2014, 44(4): 1094-1103.
[15] 贺秦禄1, 李战怀1, 王乐晓1, 王瑞2. 云存储系统聚合带宽测试技术[J]. 吉林大学学报(工学版), 2014, 44(4): 1104-1111.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!