J4 ›› 2009, Vol. 27 ›› Issue (06): 611-.

• 论文 • 上一篇    下一篇


李 巍1a,孙 涛1a,叶苑苑1b,李雄飞1a,李 楠2   

  1. 1吉林大学 a.计算机科学与技术学院;b.软件学院,长春 130012;2长春轨道客车股份有限公司|长春 130062
  • 出版日期:2009-11-20 发布日期:2009-12-18
  • 通讯作者: 李巍(1983— ),男,吉林四平人,吉林大学硕士研究生,主要从事数据库技术与XML数据挖掘研究, E-mail:autumnal_mood@163.com
  • 作者简介:李巍(1983— ),男,吉林四平人,吉林大学硕士研究生,主要从事数据库技术与XML数据挖掘研究,(Tel)86-13504475893(E-mail)autumnal_mood@163.com;李雄飞(1963— ),男,长春人,吉林大学教授,博士生导师,主要从事数据挖掘与知识发现、网格计算、信息融合研究,(Tel)86-13943095868(E-mail)lxf@jlu.edu.cn。
  • 基金资助:


XML Domument Clustering Research Based on Weighted Cosine Similarity

LI Wei1a,SUN Tao1a,YE Yuan-yuan1b,LI Xiong-fei1a|LI Nan2

  1. 1a. College of Computer Science and Technology;1b. College of Software, Jilin University, Changchun 130012, China;2Changchun Railway Nvehicles Company Limited,Changchun 130062,China
  • Online:2009-11-20 Published:2009-12-18


为了挖掘XML(Extensible Markup Language)文档在历史变化过程中不经常发生变化的结构所蕴含的知识,给出了发现冰冻结构的方法,使用一组冰冻结构组成的文档向量模型代表一个XML文档,并使用加权Jaccard系数作为相似度,利用基于XML文档历史变化过程中相对稳定的冰冻结构对XML文档进行聚类。经过实验证明,基于冰冻结构能够将XML进行有效的聚类,聚类后每簇中的XML文档具有相似的不经常变化结构。

关键词: XML文档, 文档聚类, 加权Jaccard系数, 冰冻结构


In order to mine knowledge hiden in the structures that does not often changed in the XML(Extensible Markup Language) document changing history, this paper proposes a method to fiund the frozen structures, then uses a documentvector model composition by a group of frozen structures to represent an XML document, and uses the weighted Jaccard coefficient as similarity, then cluster XML documents based on the relative stable frozen structures which found in the XML document historical change process. Through experiments show that XML documents can be effective clustering base on frozen structures, after cluster, XML documents in each cluster have similar not often changed structures.

Key words: extensible markup language(XML) document, document clustering, weighted jaccard coefficient, frozen structures


  • TP391.1