吉林大学学报(工学版) ›› 2013, Vol. 43 ›› Issue (01): 135-140.

Previous Articles     Next Articles

Improved E3 algorithm based on multi-agent parallel sampling and learning experience reuse

LIU Quan1,2, YANG Xu-dong1, JING Ling3, XIAO Fei1   

  1. 1. School of Computer Science and Technology, Soochow University, Suzhou, 215006, China;
    2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China;
    3. Department of Computer Science and Technology, Nanjing University, Nanjing 210093, China
  • Received:2012-03-11 Online:2013-01-01 Published:2013-01-01

Abstract: Existing E3 algorithm has the drawback of long convergence time, which can not be efficiently applied in practice. To overcome this problem, an improved E3 algorithm based on parallel sampling with multiple agents and learning experience reuse is proposed. In order to build an approximate model soon, in the exploration phase, the multiple agents explore the environment in parallel and collect the information of the environmental model. In the exploitation phase, the optimal value function is retained and reused to speed up the convergence. Experiment results show that the improved algorithm consumes much less to converge to an almost optimal policy and has great advantage over other two parallel reinforcement learning algorithms.

Key words: artificial intelligence, reinforcement learning, E3 algorithm, multi-agent, parallel sampling, learning experience reuse

CLC Number: 

  • TP181
[1] 徐心和,王骄. 中国象棋计算机博弈关键技术分析[J]. 小型微型计算机系统,2006,27(6): 961-969. Xu Xin-he, Wang Jiao. Key technologies analysis of Chinese Chess Computer Game[J]. Mini-Micro Systems, 2006, 27(6): 961-969.

[2] Kretchmar R M. Parallel reinforcement learning//Proceedings of the 6th World Conference on Systemics, Cybernetics, and Informatics. Orlando, Florida, USA, 2002: 114-118.

[3] Kretchmar R M. Reinforcement learning algorithms for homogenous multi-agent systems//Workshop on Agent and Swarm Programming, Cleveland, OH, USA, 2003.

[4] Printista A M, Errecalde M L, Montoya C I. A parallel implementation of Q-learning based on communication with cache[J]. Journal of Computer Science & Technology, 2002, 6: 268-278.

[5] Grounds M, Kudenko D. Parallel reinforcement learning with linear function approximation//Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Honolulu, Hawai'i, USA, 2007.

[6] Kearns M J, Singh S P. Near-optimal reinforcement learning in polynomial time[J].Machine Learning, 2002, 49(2-3): 209-232.

[7] Brafman R I, Tennenholtz M. R-MAX-a general polynomial time algorithm for near-optimal reinforcement learning[J]. Journal of Machine Learning Research, 2002, 3:213-231.

[8] Strehl A L, Littman M L. A theoretical analysis of model-based interval estimation//Proceedings of the 22th International Conference on Machine Learning, Bonn, Germany, 2005: 857-864.

[9] Szita I, Lorincz A. The many faces of optimism: a unifying approach//Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008: 1048-1055.

[10] Kakade S M. On the sample complexity of reinforcement learning.Gatsby Computational Neuroscience Unit, University College London, 2003.

[11] Szita I, Szepesvari C. Model-based reinforcement learning with nearly tight exploration complexity bounds//Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1031-1038.

[12] Domingo C.Faster near-optimal reinforcement learning:adding adaptiveness to the E3 algorithm//Proceedings of the 10th International Conference on Algorithmic Learning Theory,Volume 1720 of Lecture Notes in Computer Science,Springer.1999:241-251.

[13] Szepesvari C. Algorithms for Reinforcement Learning[M].Synthesis Lectures on Artifical Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.

[14] Wingate D, Seppi K D. P3VI: A partitioned, prioritized, parallel value iterator//Proceedings of the 21st International Conference on Machine Learning, Banff, Alberta, Canada, 2004: 109-116.
[1] DONG Sa, LIU Da-you, OUYANG Ruo-chuan, ZHU Yun-gang, LI Li-na. Logistic regression classification in networked data with heterophily based on second-order Markov assumption [J]. Journal of Jilin University(Engineering and Technology Edition), 2018, 48(5): 1571-1577.
[2] GU Hai-jun, TIAN Ya-qian, CUI Ying. Intelligent interactive agent for home service [J]. Journal of Jilin University(Engineering and Technology Edition), 2018, 48(5): 1578-1585.
[3] WANG Xu, OUYANG Ji-hong, CHEN Gui-fen. Measurement of graph similarity based on vertical dimension sequence dynamic time warping method [J]. 吉林大学学报(工学版), 2018, 48(4): 1199-1205.
[4] ZHANG Hao, ZHAN Meng-ping, GUO Liu-xiang, LI Zhi, LIU Yuan-ning, ZHANG Chun-he, CHANG Hao-wu, WANG Zhi-qiang. Human exogenous plant miRNA cross-kingdom regulatory modeling based on high-throughout data [J]. 吉林大学学报(工学版), 2018, 48(4): 1206-1213.
[5] HUANG Lan, JI Lin-ying, YAO Gang, ZHAI Rui-feng, BAI Tian. Construction of disease-symptom semantic net for misdiagnosis prompt [J]. 吉林大学学报(工学版), 2018, 48(3): 859-865.
[6] LI Xiong-fei, FENG Ting-ting, LUO Shi, ZHANG Xiao-li. Automatic music composition algorithm based on recurrent neural network [J]. 吉林大学学报(工学版), 2018, 48(3): 866-873.
[7] LIU Jie, ZHANG Ping, GAO Wan-fu. Feature selection method based on conditional relevance [J]. 吉林大学学报(工学版), 2018, 48(3): 874-881.
[8] WANG Xu, OUYANG Ji-hong, CHEN Gui-fen. Heuristic algorithm of all common subsequences of multiple sequences for measuring multiple graphs similarity [J]. 吉林大学学报(工学版), 2018, 48(2): 526-532.
[9] YANG Xin, XIA Si-jun, LIU Dong-xue, FEI Shu-min, HU Yin-ji. Target tracking based on improved accelerated gradient under tracking-learning-detection framework [J]. 吉林大学学报(工学版), 2018, 48(2): 533-538.
[10] LIU Xue-juan, YUAN Jia-bin, XU Juan, DUAN Bo-jia. Quantum k-means algorithm [J]. 吉林大学学报(工学版), 2018, 48(2): 539-544.
[11] QU Hui-yan, ZHAO Wei, QIN Ai-hong. A fast collision detection algorithm based on optimization operator [J]. 吉林大学学报(工学版), 2017, 47(5): 1598-1603.
[12] LI Jia-fei, SUN Xiao-yu. Clustering method for uncertain data based on spectral decomposition [J]. 吉林大学学报(工学版), 2017, 47(5): 1604-1611.
[13] SHAO Ke-yong, CHEN Feng, WANG Ting-ting, WANG Ji-chi, ZHOU Li-peng. Full state based adaptive control of fractional order chaotic system without equilibrium point [J]. 吉林大学学报(工学版), 2017, 47(4): 1225-1230.
[14] WANG Sheng-sheng, WANG Chuang-feng, GU Fang-ming. Spatio-temporal reasoning for OPRA direction relation network [J]. 吉林大学学报(工学版), 2017, 47(4): 1238-1243.
[15] MA Miao, LI Yi-bin. Multi-level image sequences and convolutional neural networks based human action recognition method [J]. 吉林大学学报(工学版), 2017, 47(4): 1244-1252.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!