吉林大学学报(工学版) ›› 2015, Vol. 45 ›› Issue (4): 1242-1252.doi: 10.13229/j.cnki.jdxbgxb201504032

• • 上一篇    下一篇

一种基于LDA的k话题增量训练算法

辛宇1, 杨静1, 谢志强2   

  1. 1.哈尔滨工程大学 计算机科学与技术学院, 哈尔滨 150001;
    2.哈尔滨理工大学 计算机科学与技术学院, 哈尔滨 150080
  • 收稿日期:2013-10-11 出版日期:2015-07-01 发布日期:2015-07-01
  • 通讯作者: 杨静(1962-),女,教授, 博士生导师.研究方向:数据与知识工程.E-mail:yangjing@hrbeu.edu.cn
  • 作者简介:辛宇(1987-),男,博士研究生.研究方向:数据与知识工程.E-mail:xinyu@hrbeu.edu.cn
  • 基金资助:
    国家自然科学基金项目(61370083, 61073043, 61073041, 61370086); 高等学校博士学科点专项科研基金项目(20112304110011,20122304110012)

K-topic increment training algorithm based on LDA

XIN Yu1, YANG Jing1, XIE Zhi-qiang2   

  1. 1.College of Computer Science and Technology, Harbin Engineering University, Harbin 150001,China;
    2.College of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,China
  • Received:2013-10-11 Online:2015-07-01 Published:2015-07-01

摘要: 由于LDA模型需要预先给定话题个数k,因此在进行最优话题个数k选取时需要对语料库进行k值循环计算,从而加剧了算法的复杂度。针对LDA模型的最优k值选取问题,提出LDA话题增量训练算法。该方法首先以词-话题概率熵值作为LDA迭代过程中模糊单词的选取标准,并将抽取模糊单词归入新话题;其次,增加LDA变分推理过程中全局参数β(单词-话题概率矩阵)和α(狄利克雷分布参数)的维数及话题个数k;再次,将变换后的全局参数βαk作为输入进行变分训练;最后,循环调用LDA话题增量训练算法并在似然函数值收敛时停止循环过程,完成k的增量训练。此外,通过对真实数据集的实验分析验证了本文算法对最优k值选取的有效性和可行性。

关键词: 人工智能, LDA, 变分推理, 增量训练, 话题分类, 自然语言处理

Abstract: As the number of topics k should be given in advance in LDA model, therefore, to choose the optimal number of topics k cyclical calculation is needed, which will increases the complexity of the algorithm. To solve the problem of choosing the optimal topics k, the LDA topic increments training algorithm is proposed. First, the entropy of word-topic probability is taken as the extract standard of fuzzy word in the iterative process, and a new topic is generated by the extracted fuzzy word. Second, the dimension of the global parameter β (word-topic probability matrix), dirichlet parameter α, and number of topics k, are increased in the process of variational inference. Third, the variational training algorithm is executed with the transformed global parameters β, α and k. Finally, the LDA topic increment training algorithm is executed cyclically, which is stopped while the likelihood is converged to complete the increment training of k. The effectiveness and feasibility of the proposed algorithm for optimal topics number k is verified by experimental analysis on real datasets.

Key words: artificial intelltgence, LDA, variational inference, increments training, topic classification, natural language processing

中图分类号: 

  • TP391.4
[1] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[2] 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8):1423-1436. Xu Ge, Wang Hou-feng. The development of topic models in natural language processing[J]. Chinese Journal of Computers, 2011, 34(8):1423-1436.
[3] Blei D M, Griffitchs T L,Jordan M I, et al. Hierarchical topic models and the nested Chinese restaurant process[C]∥Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press, 2004:17-24.
[4] Blei D M, Lafferty J D. Correlated topic models[C]∥Advances in Neural Information Processing Systems 18. Cambridge, MA: MIT Press, 2006.
[5] Blei D, Lafferty J. A correlated topic model of science[J]. Annals of Applied Statistics, 2007, 1(1): 17-35.
[6] Li W, McCallum A. Pachinko allocation: DAG-structured mixture models of topic correlations[C]∥Proceeding of the ICML. Pittsburgh, Pennsylvania, USA, 2006: 577-584.
[7] Mimno D, Li W, McCallum A. Mixtures of hierarchical topics with pachinko allocation[C]∥Proceeding of the ICML. Corvllis, Oregon, USA, 2007: 633-640.
[8] Wang X, McCallum A. Topics over time : a non-markov continuous-time model of topical trends[C]∥Proceeding of the Conference on Knowledge Discovery and Data Mining (KDD). Philadelphia, USA, 2006: 113-120.
[9] Griffiths T L, Steyvers M, Blei D M, et al. Integrating topics and syntax[C]∥Advances in Neural Information Processing Systems 18. Vancouver , Canada, 2004.
[10] Wallach H. Topic modeling: beyond bag-of-words[C]∥Proceeding of the 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania, 2006:977-984.
[11] 张晨逸, 孙建伶, 丁轶群. 基于MB-LDA模型的微博主题挖掘[J]. 计算机研究与发展,2011,48(10):1795-1802. Zhang Chen-yi, Sun Jian-ling, Ding Yi-qun. Topic mining for microblog based on MB-LDA model[J]. Journal of Computer Research and Development,2011,48(10): 1795-1802.
[12] 韩晓晖, 马军, 邵海敏, 等. 一种基于LDA的Web论坛低质量回贴检测方法[J]. 计算机研究与发展,2012, 49(9): 1937-1946. Han Xiao-hui, Ma Jun, Shao Hai-min, et al. An LDA based approach to detect the low-quality reply posts in web forums[J]. Journal of Computer Research and Development,2012, 49(9): 1937-1946.
[13] Blei D M, McAuliffe J. Supervised topic models[C]∥Advances in Neural Information Processing Systems (NIPS). Vancouver, Canada, 2008.
[14] Steyvers M, Smyth P, Rosen-Zvi M, et al. Probabilistic author-topic models for information discovery[C]∥Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington, 2004:306-315.
[15] McCallum A, Corrada-Emmanuel A, Wang X. The author recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email[R]. Technical Report UM-CS-2004-096, 2004.
[1] 董飒, 刘大有, 欧阳若川, 朱允刚, 李丽娜. 引入二阶马尔可夫假设的逻辑回归异质性网络分类方法[J]. 吉林大学学报(工学版), 2018, 48(5): 1571-1577.
[2] 顾海军, 田雅倩, 崔莹. 基于行为语言的智能交互代理[J]. 吉林大学学报(工学版), 2018, 48(5): 1578-1585.
[3] 王旭, 欧阳继红, 陈桂芬. 基于垂直维序列动态时间规整方法的图相似度度量[J]. 吉林大学学报(工学版), 2018, 48(4): 1199-1205.
[4] 张浩, 占萌苹, 郭刘香, 李誌, 刘元宁, 张春鹤, 常浩武, 王志强. 基于高通量数据的人体外源性植物miRNA跨界调控建模[J]. 吉林大学学报(工学版), 2018, 48(4): 1206-1213.
[5] 曹洁, 苏哲, 李晓旭. 基于Corr-LDA模型的图像标注方法[J]. 吉林大学学报(工学版), 2018, 48(4): 1237-1243.
[6] 黄岚, 纪林影, 姚刚, 翟睿峰, 白天. 面向误诊提示的疾病-症状语义网构建[J]. 吉林大学学报(工学版), 2018, 48(3): 859-865.
[7] 李雄飞, 冯婷婷, 骆实, 张小利. 基于递归神经网络的自动作曲算法[J]. 吉林大学学报(工学版), 2018, 48(3): 866-873.
[8] 刘杰, 张平, 高万夫. 基于条件相关的特征选择方法[J]. 吉林大学学报(工学版), 2018, 48(3): 874-881.
[9] 王旭, 欧阳继红, 陈桂芬. 基于多重序列所有公共子序列的启发式算法度量多图的相似度[J]. 吉林大学学报(工学版), 2018, 48(2): 526-532.
[10] 杨欣, 夏斯军, 刘冬雪, 费树岷, 胡银记. 跟踪-学习-检测框架下改进加速梯度的目标跟踪[J]. 吉林大学学报(工学版), 2018, 48(2): 533-538.
[11] 刘雪娟, 袁家斌, 许娟, 段博佳. 量子k-means算法[J]. 吉林大学学报(工学版), 2018, 48(2): 539-544.
[12] 曲慧雁, 赵伟, 秦爱红. 基于优化算子的快速碰撞检测算法[J]. 吉林大学学报(工学版), 2017, 47(5): 1598-1603.
[13] 李嘉菲, 孙小玉. 基于谱分解的不确定数据聚类方法[J]. 吉林大学学报(工学版), 2017, 47(5): 1604-1611.
[14] 邵克勇, 陈丰, 王婷婷, 王季驰, 周立朋. 无平衡点分数阶混沌系统全状态自适应控制[J]. 吉林大学学报(工学版), 2017, 47(4): 1225-1230.
[15] 王生生, 王创峰, 谷方明. OPRA方向关系网络的时空推理[J]. 吉林大学学报(工学版), 2017, 47(4): 1238-1243.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!