›› 2012, Vol. ›› Issue (06): 1510-1514.

Previous Articles     Next Articles

Classifying XML documents based on term semantics

ZHANG Li-jun, LI Zhan-huai, CHEN Qun, LOU Ying, LI Ning   

  1. School of Computer Science and Technology, Northwestern Polytechnical University, Xi'an 710072, China
  • Received:2011-09-11 Online:2012-11-01

Abstract: Due to its semi-structured characteristic XML document implies rich term semantics in the structure information. It is inappropriate to measure XML term weight by general Term Frequency-Inverse Document Frequency (TF-IDF) approach. This is because that, when measuring the XML term weight, this approach only considers the term frequency and document frequency but ignores the term semantics implied in the structure information. A novel term weight measuring approach is proposed to overcome the above shortcoming and improve the performance. This new approach takes the factors affecting term semantics into account, such as the paths which contain terms, term frequency in a certain path, frequency of document which contains a certain path, depth of path, etc. Experimental results on several datasets show that, compared with TF-IDF and rule based approaches, the proposed approach can improve the recall, the precision and F1-measure in the classification of XML documents.

Key words: computer software, semi-structured data, XML mining, XML classification, term semantics, weight measurement

CLC Number: 

  • TP311
[1] Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34(1):1-47.
[2] Tekli J, Chbeir R, Yetongnon K. An overview on XML similarity: background, current trends and future directions[J]. Computer Science Review, 2009, 3(3):151-173.
[3] Xing G, Guo J, Xia Z H. Classifying XML docu-ments based on structure/content similar-ity[C]//The 5th International Workshop of the Initiative for the Evaluation of XML Retrieval. Dagstuhl Castle, Berlin, Germany,2007:444-457.
[4] Dalamagas T, Cheng T, Winel K J, et al. A meth-odology for clustering XML documents by structure[J]. Information Systems, 2006, 31(3):187-228.
[5] Zaki M J, Aggarwal C C. XRules: an effective structural classifier for XML data[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington D C, 2003:316-325.
[6] Wu J W, Tang J. A bottom-up approach for XML documents classification[C]//Proceedings of the 12th International Database Engineering and Applications Symposium. Coimbra, Portugal, 2008:131-137.
[7] 杨建武. 基于核方法的XML文档自动分类[J]. 计算机学报, 2011, 34(2):353-359. Yang Jian-wu. XML document classification based on kernel method[J]. Chinese Journal of Computers, 2011, 34(2):353-359.
[8] Tagarelli A, Greco S. Semantic clustering of XML documents[J]. ACM Transactions on Information Systems, 2010, 28(1):1-56.
[9] Denoyer L, Gallinari P. The wikipedia XML corpus[J]. ACM SIGIR Forum, 2006, 40(1):64-69.
[10] Kurt A, Tozal E. Classification of XSLT-generated web documents with support vector machines[C]//Proceedings of the First International Workshop on Knowledge Discovery from XML Documents, Singapore, 2006:33-42.
[11] 孙涛, 李雄飞, 刘丽娟. 数据分布不敏感的决策树算法[J]. 吉林大学学报:工学版, 2009, 39(6):1607-1611. Sun Tao, Li Xiong-fei, Liu Li-juan. Algorithm of decision trees insensitive to data distribution[J]. Journal of Jilin University(Engineering and Technology Edition), 2009, 39(6):1607-1611.
[12] 袁正午,朱冠宇,丰江帆,等. 基于支持向量机的视频语义场景分割算法研究[J]. 重庆邮电大学学报:自然科学版,2010,22(4):458-463. Yuan Zheng-wu, Zhu Guan-yu, Feng Jiang-fan, et al.Research on the method of video semantic scene constructing based on SVM[J].Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), 2010,22(4):458-463.
[13] ]Machine Learning Group at National Taiwan University. Liblinear—a library for large linear classification [DB/OL]. [2010-09-25]. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
[1] MA Jian, FAN Jian-ping, LIU Feng, LI Hong-hui. The evolution model of objective-oriented software system [J]. 吉林大学学报(工学版), 2018, 48(2): 545-550.
[2] LUO Yang-xia, GUO Ye. Software recognition based on features of data dependency [J]. 吉林大学学报(工学版), 2017, 47(6): 1894-1902.
[3] YING Huan, WANG Dong-hui, WU Cheng-gang, WANG Zhe, TANG Bo-wen, LI Jian-jun. Efficient deterministic replay technique on commodity system environment [J]. 吉林大学学报(工学版), 2017, 47(1): 208-217.
[4] LI Yong, HUANG Zhi-qiu, WANG Yong, FANG Bing-wu. New approach of cross-project defect prediction based on multi-source data [J]. 吉林大学学报(工学版), 2016, 46(6): 2034-2041.
[5] WANG Nian-bin, ZHU Guan-wen, ZHOU Lian-ke, WANG Hong-wei. Novel dataspace index for efficient processing of path query [J]. 吉林大学学报(工学版), 2016, 46(3): 911-916.
[6] TE Ri-gen, JIANG Sheng, LI Xiong-fei, LI Jun. Document compression scheme based on integer data [J]. 吉林大学学报(工学版), 2016, 46(1): 228-234.
[7] CHEN Peng-fei, TIAN Di, YANG Guang. Design and implementation of LIBS software based on MVC architecture [J]. 吉林大学学报(工学版), 2016, 46(1): 242-245.
[8] LIU Lei, WANG Yan-yan, SHEN Chun, LI Yu-xiang, LIU Lei. Performance portable GPU parallel optimization technique on Bellman-Ford algorithm [J]. 吉林大学学报(工学版), 2015, 45(5): 1559-1564.
[9] FENG Xiao-ning, WANG Zhuo, ZHANG Xu. Formal method for routing protocol of WSN based on L-π calculus [J]. 吉林大学学报(工学版), 2015, 45(5): 1565-1571.
[10] LI Ming-zhe, WANG Jin-lin, CHEN Xiao, CHEN Jun. Architecture model of streaming media applications on network processors(VPL) [J]. 吉林大学学报(工学版), 2015, 45(5): 1572-1580.
[11] WANG Ke-chao, WANG Tian-tian, SU Xiao-hong, MA Pei-jun. Plagiarism detection in student programs based on frequent closed sequence mining [J]. 吉林大学学报(工学版), 2015, 45(4): 1260-1265.
[12] HUANG Hong-tao,WANG Jing,YE Hai-zhi,HUANG Shao-bin. Lazy slicing based method for verifying linear temporal logic property [J]. 吉林大学学报(工学版), 2015, 45(1): 245-251.
[13] FAN Da-juan, HUANG Zhi-qiu, XIAO Fang-xiong, ZHU Yi, WANG Jin. Compatibility analysis and adaptor generation for multi-service interaction [J]. 吉林大学学报(工学版), 2014, 44(4): 1094-1103.
[14] HE Qin-lu, LI Zhan-huai, WANG Le-xiao, WANG Rui. Testing technology for aggregate bandwidth of cloud storage system [J]. 吉林大学学报(工学版), 2014, 44(4): 1104-1111.
[15] LIU Guo-qi, LIU Hui, GAO Yu, LIU Ying, ZHU Zhi-liang. Resource dynamic pricing strategy based on utility in cloud computing [J]. 吉林大学学报(工学版), 2013, 43(06): 1631-1637.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!