吉林大学学报(工学版) ›› 2015, Vol. 45 ›› Issue (4): 1242-1252.doi: 10.13229/j.cnki.jdxbgxb201504032

Previous Articles     Next Articles

K-topic increment training algorithm based on LDA

XIN Yu1, YANG Jing1, XIE Zhi-qiang2   

  1. 1.College of Computer Science and Technology, Harbin Engineering University, Harbin 150001,China;
    2.College of Computer Science and Technology, Harbin University of Science and Technology, Harbin 150080,China
  • Received:2013-10-11 Online:2015-07-01 Published:2015-07-01

Abstract: As the number of topics k should be given in advance in LDA model, therefore, to choose the optimal number of topics k cyclical calculation is needed, which will increases the complexity of the algorithm. To solve the problem of choosing the optimal topics k, the LDA topic increments training algorithm is proposed. First, the entropy of word-topic probability is taken as the extract standard of fuzzy word in the iterative process, and a new topic is generated by the extracted fuzzy word. Second, the dimension of the global parameter β (word-topic probability matrix), dirichlet parameter α, and number of topics k, are increased in the process of variational inference. Third, the variational training algorithm is executed with the transformed global parameters β, α and k. Finally, the LDA topic increment training algorithm is executed cyclically, which is stopped while the likelihood is converged to complete the increment training of k. The effectiveness and feasibility of the proposed algorithm for optimal topics number k is verified by experimental analysis on real datasets.

Key words: artificial intelltgence, LDA, variational inference, increments training, topic classification, natural language processing

CLC Number: 

  • TP391.4
[1] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[2] 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8):1423-1436. Xu Ge, Wang Hou-feng. The development of topic models in natural language processing[J]. Chinese Journal of Computers, 2011, 34(8):1423-1436.
[3] Blei D M, Griffitchs T L,Jordan M I, et al. Hierarchical topic models and the nested Chinese restaurant process[C]∥Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press, 2004:17-24.
[4] Blei D M, Lafferty J D. Correlated topic models[C]∥Advances in Neural Information Processing Systems 18. Cambridge, MA: MIT Press, 2006.
[5] Blei D, Lafferty J. A correlated topic model of science[J]. Annals of Applied Statistics, 2007, 1(1): 17-35.
[6] Li W, McCallum A. Pachinko allocation: DAG-structured mixture models of topic correlations[C]∥Proceeding of the ICML. Pittsburgh, Pennsylvania, USA, 2006: 577-584.
[7] Mimno D, Li W, McCallum A. Mixtures of hierarchical topics with pachinko allocation[C]∥Proceeding of the ICML. Corvllis, Oregon, USA, 2007: 633-640.
[8] Wang X, McCallum A. Topics over time : a non-markov continuous-time model of topical trends[C]∥Proceeding of the Conference on Knowledge Discovery and Data Mining (KDD). Philadelphia, USA, 2006: 113-120.
[9] Griffiths T L, Steyvers M, Blei D M, et al. Integrating topics and syntax[C]∥Advances in Neural Information Processing Systems 18. Vancouver , Canada, 2004.
[10] Wallach H. Topic modeling: beyond bag-of-words[C]∥Proceeding of the 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania, 2006:977-984.
[11] 张晨逸, 孙建伶, 丁轶群. 基于MB-LDA模型的微博主题挖掘[J]. 计算机研究与发展,2011,48(10):1795-1802. Zhang Chen-yi, Sun Jian-ling, Ding Yi-qun. Topic mining for microblog based on MB-LDA model[J]. Journal of Computer Research and Development,2011,48(10): 1795-1802.
[12] 韩晓晖, 马军, 邵海敏, 等. 一种基于LDA的Web论坛低质量回贴检测方法[J]. 计算机研究与发展,2012, 49(9): 1937-1946. Han Xiao-hui, Ma Jun, Shao Hai-min, et al. An LDA based approach to detect the low-quality reply posts in web forums[J]. Journal of Computer Research and Development,2012, 49(9): 1937-1946.
[13] Blei D M, McAuliffe J. Supervised topic models[C]∥Advances in Neural Information Processing Systems (NIPS). Vancouver, Canada, 2008.
[14] Steyvers M, Smyth P, Rosen-Zvi M, et al. Probabilistic author-topic models for information discovery[C]∥Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington, 2004:306-315.
[15] McCallum A, Corrada-Emmanuel A, Wang X. The author recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email[R]. Technical Report UM-CS-2004-096, 2004.
[1] CAO Jie, SU Zhe, LI Xiao-xu. Image annotation method based on Corr-LDA model [J]. 吉林大学学报(工学版), 2018, 48(4): 1237-1243.
[2] ZHOU Xuan-yu, LIU Juan, SHAO Peng, LUO Fei, LIU Yang. Chinese anaphora resolution based on multi-pass sieve model [J]. 吉林大学学报(工学版), 2016, 46(4): 1209-1215.
[3] LI Di-fei, TIAN Di, HU Xiong-wei. Standard literature language model based on deep learning [J]. 吉林大学学报(工学版), 2015, 45(2): 596-599.
[4] CAO Jie, WU Di, LI Wei. Face recognition based on discrimination power analysis and LDA-LPP algorithm [J]. , 2012, (06): 1527-1531.
[5] LIU Qiu-Jun, WANG Dun, ZHANG Gong-Mei. Identification of eggshell crack by pattern recognition [J]. 吉林大学学报(工学版), 2010, 40(03): 873-0878.
[6] GAO Feng-li, GUO Shu-xu, ZHANG Shuang, CAO Jun-sheng, YU Si-yao . Highpower direct current supply SM3545 to drive laser arrays [J]. 吉林大学学报(工学版), 2008, 38(05): 1248-1251.
[7] Lu Guang-lin,Wang Chun-hua,Wang Yi,Qiu Xiao-ming . Weldability and microstructure of cBN brazed with Agbased filler [J]. 吉林大学学报(工学版), 2007, 37(05): 1088-1092.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!