一种基于LDA的k话题增量训练算法

doi:10.13229/j.cnki.jdxbgxb201504032

Abstract

Abstract: As the number of topics k should be given in advance in LDA model, therefore, to choose the optimal number of topics k cyclical calculation is needed, which will increases the complexity of the algorithm. To solve the problem of choosing the optimal topics k, the LDA topic increments training algorithm is proposed. First, the entropy of word-topic probability is taken as the extract standard of fuzzy word in the iterative process, and a new topic is generated by the extracted fuzzy word. Second, the dimension of the global parameter β (word-topic probability matrix), dirichlet parameter α, and number of topics k, are increased in the process of variational inference. Third, the variational training algorithm is executed with the transformed global parameters β, α and k. Finally, the LDA topic increment training algorithm is executed cyclically, which is stopped while the likelihood is converged to complete the increment training of k. The effectiveness and feasibility of the proposed algorithm for optimal topics number k is verified by experimental analysis on real datasets.

Key words: artificial intelltgence, LDA, variational inference, increments training, topic classification, natural language processing

CLC Number:

TP391.4

XIN Yu, YANG Jing, XIE Zhi-qiang. K-topic increment training algorithm based on LDA[J].吉林大学学报(工学版), 2015, 45(4): 1242-1252.

References

[1] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[2] 徐戈, 王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报, 2011, 34(8):1423-1436. Xu Ge, Wang Hou-feng. The development of topic models in natural language processing[J]. Chinese Journal of Computers, 2011, 34(8):1423-1436.
[3] Blei D M, Griffitchs T L,Jordan M I, et al. Hierarchical topic models and the nested Chinese restaurant process[C]∥Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press, 2004:17-24.
[4] Blei D M, Lafferty J D. Correlated topic models[C]∥Advances in Neural Information Processing Systems 18. Cambridge, MA: MIT Press, 2006.
[5] Blei D, Lafferty J. A correlated topic model of science[J]. Annals of Applied Statistics, 2007, 1(1): 17-35.
[6] Li W, McCallum A. Pachinko allocation: DAG-structured mixture models of topic correlations[C]∥Proceeding of the ICML. Pittsburgh, Pennsylvania, USA, 2006: 577-584.
[7] Mimno D, Li W, McCallum A. Mixtures of hierarchical topics with pachinko allocation[C]∥Proceeding of the ICML. Corvllis, Oregon, USA, 2007: 633-640.
[8] Wang X, McCallum A. Topics over time : a non-markov continuous-time model of topical trends[C]∥Proceeding of the Conference on Knowledge Discovery and Data Mining (KDD). Philadelphia, USA, 2006: 113-120.
[9] Griffiths T L, Steyvers M, Blei D M, et al. Integrating topics and syntax[C]∥Advances in Neural Information Processing Systems 18. Vancouver , Canada, 2004.
[10] Wallach H. Topic modeling: beyond bag-of-words[C]∥Proceeding of the 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania, 2006:977-984.
[11] 张晨逸, 孙建伶, 丁轶群. 基于MB-LDA模型的微博主题挖掘[J]. 计算机研究与发展,2011,48(10):1795-1802. Zhang Chen-yi, Sun Jian-ling, Ding Yi-qun. Topic mining for microblog based on MB-LDA model[J]. Journal of Computer Research and Development,2011,48(10): 1795-1802.
[12] 韩晓晖, 马军, 邵海敏, 等. 一种基于LDA的Web论坛低质量回贴检测方法[J]. 计算机研究与发展,2012, 49(9): 1937-1946. Han Xiao-hui, Ma Jun, Shao Hai-min, et al. An LDA based approach to detect the low-quality reply posts in web forums[J]. Journal of Computer Research and Development,2012, 49(9): 1937-1946.
[13] Blei D M, McAuliffe J. Supervised topic models[C]∥Advances in Neural Information Processing Systems (NIPS). Vancouver, Canada, 2008.
[14] Steyvers M, Smyth P, Rosen-Zvi M, et al. Probabilistic author-topic models for information discovery[C]∥Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Seattle, Washington, 2004:306-315.
[15] McCallum A, Corrada-Emmanuel A, Wang X. The author recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email[R]. Technical Report UM-CS-2004-096, 2004.

Related Articles 7

[1]	CAO Jie, SU Zhe, LI Xiao-xu. Image annotation method based on Corr-LDA model [J]. 吉林大学学报(工学版), 2018, 48(4): 1237-1243.
[2]	ZHOU Xuan-yu, LIU Juan, SHAO Peng, LUO Fei, LIU Yang. Chinese anaphora resolution based on multi-pass sieve model [J]. 吉林大学学报(工学版), 2016, 46(4): 1209-1215.
[3]	LI Di-fei, TIAN Di, HU Xiong-wei. Standard literature language model based on deep learning [J]. 吉林大学学报(工学版), 2015, 45(2): 596-599.
[4]	CAO Jie, WU Di, LI Wei. Face recognition based on discrimination power analysis and LDA-LPP algorithm [J]. , 2012, (06): 1527-1531.
[5]	LIU Qiu-Jun, WANG Dun, ZHANG Gong-Mei. Identification of eggshell crack by pattern recognition [J]. 吉林大学学报(工学版), 2010, 40(03): 873-0878.
[6]	GAO Feng-li, GUO Shu-xu, ZHANG Shuang, CAO Jun-sheng, YU Si-yao . Highpower direct current supply SM3545 to drive laser arrays [J]. 吉林大学学报(工学版), 2008, 38(05): 1248-1251.
[7]	Lu Guang-lin，Wang Chun-hua，Wang Yi，Qiu Xiao-ming . Weldability and microstructure of cBN brazed with Agbased filler [J]. 吉林大学学报(工学版), 2007, 37(05): 1088-1092.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

K-topic increment training algorithm based on LDA

RICH HTML

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 7

Metrics

Comments

Recommended 0