基于异步合作更新的LSTM-MADDPG多智能体协同决策算法

doi:10.13229/j.cnki.jdxbgxb.20220523

摘要/Abstract

摘要：

针对完全合作型任务中，多智能体深度确定性策略梯度（MADDPG）算法存在信度分配以及训练稳定性差的问题，提出了一种基于异步合作更新的LSTM-MADDPG多智能体协同决策算法。基于差异奖励和值分解思想，利用长短时记忆（LSTM）网络提取轨迹序列间特征，优化全局奖励划分方法，实现各智能体的动作奖励分配；结合算法联合训练需求，构建高质量训练样本集，设计异步合作更新方法，实现LSTM-MADDPG网络的联合稳定训练。仿真结果表明，在协作捕获场景中，本文算法相较于QMIX的训练收敛速度提升了20.51%；所提异步合作更新方法相较于同步更新，归一化奖励值均方误差减小了57.59%，提高了算法收敛的稳定性。

关键词: 人工智能, 多智能体协同决策, 深度强化学习, 信度分配, 异步合作更新

Abstract:

In fully cooperative tasks， the MADDPG algorithm has credit assignment and poor stability of training problem. To address this problem， a LSTM-MADDPG multi-agent cooperative decision algorithm based on asynchronous collaborative update was proposed. According to the idea of Difference Reward and Value Decomposition， LSTM was used to extract the characteristics between trajectory sequences. The global reward division was optimized to realize the agent's reward distribution. In order to meet requirements of algorithm joint training， the high-quality training set was constructed. Then， the asynchronous cooperative update method was designed to joint train the LSTM-MADDPG network， and realize the cooperation of multi-agent. In cooperative capture scene， the simulation results show that the convergence speed of the proposed algorithm is increased by 20.51% compared with the QMIX. After the convergence of algorithm training， the update method of asynchronous cooperation reduces the mean square error of normalized reward value by 57.59% compared with synchronous update， which improves the stability of algorithm convergence.

Key words: artificial intelligence, multi-agent coordination decision making, deep reinforcement learning, credit assignment, update of asynchronous cooperation

中图分类号:

TP18

高敬鹏,王国轩,高路. 基于异步合作更新的LSTM-MADDPG多智能体协同决策算法[J]. 吉林大学学报(工学版), 2024, 54(3): 797-806.

Jing-peng GAO,Guo-xuan WANG,Lu GAO. LSTM⁃MADDPG multi⁃agent cooperative decision algorithm based on asynchronous collaborative update[J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(3): 797-806.

图/表 13

图1

图2

图3

图4

图5

表1

表2

协同捕获奖励设置"

智能体动作	动作评判	奖励值
接近/远离目标	奖励/惩罚	$r g$
捕获目标	奖励	10
发生碰撞	惩罚	-5
越过边界	惩罚	-5

表2

表3

训练参数设置"

参数	数值	参数	数值
Episode	60 000	Batch_size 1	1024
Step	30	Batch_size 2	32
D₁	25 600	$γ$	0.95
D₂	128	$τ$	0.000 1
Lr	0.01	N	5

表3

图6

表4

图7

表5

表6

参考文献 17

1	Feng J, Li H, Huang M, et al. Learning to collaborate: multi-scenario ranking via multi-agent reinforcement learning[C]∥Proceedings of the 2018 World Wide Web Conference, Lyon, France, 2018: 1939-1948.
2	杨顺, 蒋渊德, 吴坚, 等. 基于多类型传感数据的自动驾驶深度强化学习方法[J]. 吉林大学学报: 工学版, 2019, 49(4): 1026-1033.
	Yang Shun, Jiang Yuan-de, Wu Jian, et al. Autonomous driving policy learning based on deep reinforcement learning and multi-type sensor data[J]. Journal of Jilin University (Engineering and Technology Edition), 2019, 49(4): 1026-1033.
3	Hernandez-Leal P, Kartal B, Taylor M E. A survey and critique of multi-agent deep reinforcement learning[J]. Auto Agent Multi-Agent Systems, 2019, 33(6): 750-797.
4	Lowe R, Wu Y, Tamar Aviv, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]∥Proceedings of the 31th International Conference on Neural Information Processing Systems, New York, USA, 2017: 6382-6393.
5	Nguyen T T, Nguyen N D, Nahavandi S. Deep reinforcement learning for multi-agent systems: a review of challenges, solutions, and applications[J]. IEEE Transactions on Cybernetics, 2020, 50(9): 3826-3839.
6	Holmesparker C, Taylor M E, Agogino A, et al. Cleaning the reward: counterfactual actions to remove exploratory action noise in multi-agent learning[C]∥Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, Paris, France, 2014: 1353-1354.
7	Chang Y H, Ho T, Kaelbling L P. All learning is local: multi-agent learning in global reward games[C]∥Proceedings of the 17th Advances in Neural Information Processing Systems, Cambridge, USA,2004: 807-814.
8	Devlin S, Yliniemi L, Kudenko D, et al. Potential based difference rewards for multi-agent reinforcement learning[C]∥Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, Paris, France, 2014: 165-172.
9	Foerster J N, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients[C]∥Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018: 2974-2982.
10	Chen J, Guo L, Jia J, et al. Resource allocation for IRS assisted SGF NOMA transmission: a MADRL approach[J]. IEEE Journal on Selected Areas in Communications, 2022, 40(4): 1302-1316.
11	Sunehag P, Lever G, Gruslys A, et al. Value decomposition networks for cooperative multi-agent learning based on team reward[C]∥Proceedings of the 17th International Conference on Autonomous Agents and Multi-agent Systems, Stockholm, Sweden, 2018: 2085-2087.
12	Rashid T, Samvelyan M, Schroeder C, et al. QMIX: monotonic value function factorization for deep multi-agent reinforcement learning[C]∥Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 2018: 4295-4304.
13	Son K, Kim D, Kang W J, et al. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]∥Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, 2019: 5887-5896.
14	施伟, 冯旸赫, 程光权, 等. 基于深度强化学习的多机协同空战方法研究[J]. 自动化学报, 2021, 47(7): 1610-1623.
	Shi Wei, Feng Yang-he, Cheng Guang-quan, et al. Research on multi-aircraft cooperative air combat method based on deep reinforcement learning[J]. Acta Automation Sinica, 2021, 47(7): 1610-1623.
15	Wan K, Wu D, Li B, et al. ME-MADDPG: an efficient learning based motion planning method for multiple agents in complex environments[J]. International Journal of Intelligent Systems, 2022, 37(3): 2393-2427.
16	王乃钰, 叶育鑫, 刘露, 等. 基于深度学习的语言模型研究进展[J]. 软件学报, 2021, 32(4): 1082-1115.
	Wang Nai-yu, Ye Yu-xin, Liu Lu, et al. Language models based on deep learning: a review[J]. Journal of Software, 2021, 32(4): 1082-1115.
17	Pan Z Y, Zhang Z Z, Chen Z X. Asynchronous value iteration network[C]∥Proceedings of the 25th International Conference on Neural Information Processing, Red Hook, USA, 2018: 169-180.

相关文章 15

[1]	刘浏,丁鲲,刘姗姗,刘茗. 基于机器阅读理解的事件检测方法[J]. 吉林大学学报(工学版), 2024, 54(2): 533-539.
[2]	张健,李青扬,李丹,姜夏,雷艳红,季亚平. 基于深度强化学习的自动驾驶车辆专用道汇入引导[J]. 吉林大学学报(工学版), 2023, 53(9): 2508-2518.
[3]	李健,熊琦,胡雅婷,刘孔宇. 基于Transformer和隐马尔科夫模型的中文命名实体识别方法[J]. 吉林大学学报(工学版), 2023, 53(5): 1427-1434.
[4]	田彦涛,季言实,唱寰,谢波. 深度强化学习智能驾驶汽车增广决策模型[J]. 吉林大学学报(工学版), 2023, 53(3): 682-692.
[5]	刘春晖,王思长,郑策,陈秀连,郝春蕾. 基于深度学习的室内导航机器人避障规划算法[J]. 吉林大学学报(工学版), 2023, 53(12): 3558-3564.
[6]	庄伟超,丁昊楠,董昊轩,殷国栋,王茜,周朝宾,徐利伟. 信号交叉口网联电动汽车自适应学习生态驾驶策略[J]. 吉林大学学报(工学版), 2023, 53(1): 82-93.
[7]	白天,徐明蔚,刘思铭,张佶安,王喆. 基于深度神经网络的诉辩文本争议焦点识别[J]. 吉林大学学报(工学版), 2022, 52(8): 1872-1880.
[8]	王生生,姜林延,杨永波. 基于最优传输特征选择的医学图像分割迁移学习[J]. 吉林大学学报(工学版), 2022, 52(7): 1626-1638.
[9]	田皓宇,马昕,李贻斌. 基于骨架信息的异常步态识别方法[J]. 吉林大学学报(工学版), 2022, 52(4): 725-737.
[10]	刘勇,徐雷,张楚晗. 面向文本游戏的深度强化学习模型[J]. 吉林大学学报(工学版), 2022, 52(3): 666-674.
[11]	王忠立,王浩,申艳,蔡伯根. 一种多感知多约束奖励机制的驾驶策略学习方法[J]. 吉林大学学报(工学版), 2022, 52(11): 2718-2727.
[12]	雷景佩,欧阳丹彤,张立明. 基于知识图谱嵌入的定义域值域约束补全方法[J]. 吉林大学学报(工学版), 2022, 52(1): 154-161.
[13]	李志华,张烨超,詹国华. 三维水声海底地形地貌实时拼接与可视化[J]. 吉林大学学报(工学版), 2022, 52(1): 180-186.
[14]	欧阳丹彤,张必歌,田乃予,张立明. 结合格局检测与局部搜索的故障数据缩减方法[J]. 吉林大学学报(工学版), 2021, 51(6): 2144-2153.
[15]	徐艳蕾,何润,翟钰婷,赵宾,李陈孝. 基于轻量卷积网络的田间自然环境杂草识别方法[J]. 吉林大学学报(工学版), 2021, 51(6): 2304-2312.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

参数	数值	参数	数值
场景范围	2×2	目标个数	5
智能体个数	5	目标半径	0.02
智能体半径	0.06	仿真时间步间隔	0.1
速度区间	［0，1］	仿真时间步数	30

算法	平均百回合训练耗时/s	总训练耗时/h
MADDPG	13.79	2.30
QMIX	15.63	2.61
同步LSTM-MADDPG	18.59	3.10

算法	平均碰撞次数	平均与目标单位距离
本文	1.015	0.048
QMIX	1.142	0.094
MADDPG	1.223	0.125