基于多Agent并行采样和学习经验复用的E3算法

Abstract

Abstract: Existing E³ algorithm has the drawback of long convergence time, which can not be efficiently applied in practice. To overcome this problem, an improved E³ algorithm based on parallel sampling with multiple agents and learning experience reuse is proposed. In order to build an approximate model soon, in the exploration phase, the multiple agents explore the environment in parallel and collect the information of the environmental model. In the exploitation phase, the optimal value function is retained and reused to speed up the convergence. Experiment results show that the improved algorithm consumes much less to converge to an almost optimal policy and has great advantage over other two parallel reinforcement learning algorithms.

Key words: artificial intelligence, reinforcement learning, E³ algorithm, multi-agent, parallel sampling, learning experience reuse

CLC Number:

TP181

LIU Quan, YANG Xu-dong, JING Ling, XIAO Fei. Improved E³ algorithm based on multi-agent parallel sampling and learning experience reuse[J].吉林大学学报(工学版), 2013, 43(01): 135-140.

References

[1] 徐心和,王骄. 中国象棋计算机博弈关键技术分析[J]. 小型微型计算机系统,2006,27(6): 961-969. Xu Xin-he, Wang Jiao. Key technologies analysis of Chinese Chess Computer Game[J]. Mini-Micro Systems, 2006, 27(6): 961-969.

[2] Kretchmar R M. Parallel reinforcement learning//Proceedings of the 6th World Conference on Systemics, Cybernetics, and Informatics. Orlando, Florida, USA, 2002: 114-118.

[3] Kretchmar R M. Reinforcement learning algorithms for homogenous multi-agent systems//Workshop on Agent and Swarm Programming, Cleveland, OH, USA, 2003.

[4] Printista A M, Errecalde M L, Montoya C I. A parallel implementation of Q-learning based on communication with cache[J]. Journal of Computer Science & Technology, 2002, 6: 268-278.

[5] Grounds M, Kudenko D. Parallel reinforcement learning with linear function approximation//Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), Honolulu, Hawai'i, USA, 2007.

[6] Kearns M J, Singh S P. Near-optimal reinforcement learning in polynomial time[J].Machine Learning, 2002, 49(2-3): 209-232.

[7] Brafman R I, Tennenholtz M. R-MAX-a general polynomial time algorithm for near-optimal reinforcement learning[J]. Journal of Machine Learning Research, 2002, 3:213-231.

[8] Strehl A L, Littman M L. A theoretical analysis of model-based interval estimation//Proceedings of the 22th International Conference on Machine Learning, Bonn, Germany, 2005: 857-864.

[9] Szita I, Lorincz A. The many faces of optimism: a unifying approach//Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 2008: 1048-1055.

[10] Kakade S M. On the sample complexity of reinforcement learning.Gatsby Computational Neuroscience Unit, University College London, 2003.

[11] Szita I, Szepesvari C. Model-based reinforcement learning with nearly tight exploration complexity bounds//Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 2010: 1031-1038.

[12] Domingo C.Faster near-optimal reinforcement learning:adding adaptiveness to the E³ algorithm//Proceedings of the 10th International Conference on Algorithmic Learning Theory,Volume 1720 of Lecture Notes in Computer Science,Springer.1999:241-251.

[13] Szepesvari C. Algorithms for Reinforcement Learning[M].Synthesis Lectures on Artifical Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.

[14] Wingate D, Seppi K D. P3VI: A partitioned, prioritized, parallel value iterator//Proceedings of the 21st International Conference on Machine Learning, Banff, Alberta, Canada, 2004: 109-116.

Related Articles 15

[1]	DONG Sa, LIU Da-you, OUYANG Ruo-chuan, ZHU Yun-gang, LI Li-na. Logistic regression classification in networked data with heterophily based on second-order Markov assumption [J]. Journal of Jilin University(Engineering and Technology Edition), 2018, 48(5): 1571-1577.
[2]	GU Hai-jun, TIAN Ya-qian, CUI Ying. Intelligent interactive agent for home service [J]. Journal of Jilin University(Engineering and Technology Edition), 2018, 48(5): 1578-1585.
[3]	WANG Xu, OUYANG Ji-hong, CHEN Gui-fen. Measurement of graph similarity based on vertical dimension sequence dynamic time warping method [J]. 吉林大学学报(工学版), 2018, 48(4): 1199-1205.
[4]	ZHANG Hao, ZHAN Meng-ping, GUO Liu-xiang, LI Zhi, LIU Yuan-ning, ZHANG Chun-he, CHANG Hao-wu, WANG Zhi-qiang. Human exogenous plant miRNA cross-kingdom regulatory modeling based on high-throughout data [J]. 吉林大学学报(工学版), 2018, 48(4): 1206-1213.
[5]	HUANG Lan, JI Lin-ying, YAO Gang, ZHAI Rui-feng, BAI Tian. Construction of disease-symptom semantic net for misdiagnosis prompt [J]. 吉林大学学报(工学版), 2018, 48(3): 859-865.
[6]	LI Xiong-fei, FENG Ting-ting, LUO Shi, ZHANG Xiao-li. Automatic music composition algorithm based on recurrent neural network [J]. 吉林大学学报(工学版), 2018, 48(3): 866-873.
[7]	LIU Jie, ZHANG Ping, GAO Wan-fu. Feature selection method based on conditional relevance [J]. 吉林大学学报(工学版), 2018, 48(3): 874-881.
[8]	WANG Xu, OUYANG Ji-hong, CHEN Gui-fen. Heuristic algorithm of all common subsequences of multiple sequences for measuring multiple graphs similarity [J]. 吉林大学学报(工学版), 2018, 48(2): 526-532.
[9]	YANG Xin, XIA Si-jun, LIU Dong-xue, FEI Shu-min, HU Yin-ji. Target tracking based on improved accelerated gradient under tracking-learning-detection framework [J]. 吉林大学学报(工学版), 2018, 48(2): 533-538.
[10]	LIU Xue-juan, YUAN Jia-bin, XU Juan, DUAN Bo-jia. Quantum k-means algorithm [J]. 吉林大学学报(工学版), 2018, 48(2): 539-544.
[11]	QU Hui-yan, ZHAO Wei, QIN Ai-hong. A fast collision detection algorithm based on optimization operator [J]. 吉林大学学报(工学版), 2017, 47(5): 1598-1603.
[12]	LI Jia-fei, SUN Xiao-yu. Clustering method for uncertain data based on spectral decomposition [J]. 吉林大学学报(工学版), 2017, 47(5): 1604-1611.
[13]	SHAO Ke-yong, CHEN Feng, WANG Ting-ting, WANG Ji-chi, ZHOU Li-peng. Full state based adaptive control of fractional order chaotic system without equilibrium point [J]. 吉林大学学报(工学版), 2017, 47(4): 1225-1230.
[14]	WANG Sheng-sheng, WANG Chuang-feng, GU Fang-ming. Spatio-temporal reasoning for OPRA direction relation network [J]. 吉林大学学报(工学版), 2017, 47(4): 1238-1243.
[15]	MA Miao, LI Yi-bin. Multi-level image sequences and convolutional neural networks based human action recognition method [J]. 吉林大学学报(工学版), 2017, 47(4): 1244-1252.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Improved E³ algorithm based on multi-agent parallel sampling and learning experience reuse

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

Improved E3 algorithm based on multi-agent parallel sampling and learning experience reuse

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0

Improved E³ algorithm based on multi-agent parallel sampling and learning experience reuse