基于相关熵诱导度量的近端策略优化算法

吉林大学学报(信息科学版) ›› 2023, Vol. 41 ›› Issue (3): 437-443.

基于相关熵诱导度量的近端策略优化算法

张会珍, 王强

东北石油大学电气信息工程学院, 黑龙江大庆 163318

收稿日期:2022-05-14 出版日期:2023-06-08 发布日期:2023-06-14
作者简介:张会珍(1979— ), 女, 天津人, 东北石油大学副教授, 硕士生导师, 主要从事复杂系统的鲁棒控制研究, (Tel)86-454-6504062(E-mail)zhuizhen2002@ 126. com。
基金资助:
黑龙江省自然科学基金资助项目(F2018004)

Proximal Policy Optimization Algorithm Based on Correntropy Induced Metric

ZHANG Huizhen, WANG Qiang

School of Electrical and Informatioin Engineering, Northeast Pertroleum University, Daqing 163318, China

Received:2022-05-14 Online:2023-06-08 Published:2023-06-14

摘要/Abstract

摘要： 在深度强化学习算法中, 近端策略优化算法 PPO( Proximal Policy Optimization) 在许多实验任务中表现优异, 但具有自适应 KL(Kullback-Leibler) 散度的 KL-PPO 由于其不对称性而影响了 KL-PPO 策略更新效率,为此, 提出了一种基于相关熵诱导度量的近端策略优化算法 CIM-PPO (Correntropy Induced Metric-PPO)。该算法具有对称性更适合表征新旧策略的差异, 能准确地进行策略更新, 进而改善不对称性带来的影响。通过OpenAI gym 实验测试表明, 相比于主流近端策略优化算法 Clip-PPO 和 KL-PPO 算法均能获得高于 50% 以上的奖励, 收敛速度在不同环境均有 500 ~ 1 100 回合左右的加快, 同时也具有良好的鲁棒性。

关键词: KL 散度; , 近端策略优化(PPO); , 相关熵诱导度量(CIM); , 替代目标; , 深度强化学习

Abstract: In the deep Reinforcement Learning, the PPO ( Proximal Policy Optimization) performs very well in many experimental tasks. However, KL(Kullback-Leibler) -PPO with adaptive KL divergence affects the update efficiency of KL-PPO strategy because of its asymmetry. In order to solve the negative impact of this asymmetry, Proximal Policy Optimization algorithm based on CIM( Correntropy Induced Metric) is proposed characterize the difference between the old and new strategies, update the policies more accurately, and then the experimental test of OpenAI gym shows that compared with the mainstream near end strategy optimization algorithms clip PPO and KL PPO, the proposed algorithm can obtain more than 50% reward, and the convergence speed is accelerated by about 500 ~ 1 100 episodes in different environments. And it also has good robustness.

Key words: kullback-leibler(KL) divergence; , proximal policy optimization(PPO); , correntropy induced metric (CIM); , alternative target; , deep reinforcement learning

中图分类号:

TP273

张会珍, 王强. 基于相关熵诱导度量的近端策略优化算法[J]. 吉林大学学报(信息科学版), 2023, 41(3): 437-443.

ZHANG Huizhen, WANG Qiang. Proximal Policy Optimization Algorithm Based on Correntropy Induced Metric[J]. Journal of Jilin University (Information Science Edition), 2023, 41(3): 437-443.

[1]	王灿宇 , 孙晓海 , 吴叶辉 , 季荣彪 , 李亚东 , 张少如 , 杨士豪 . 基于 Deep Q-Learning 的抽取式摘要生成方法[J]. 吉林大学学报(信息科学版), 2023, 41(2): 306-314.
[2]	刘庆强, 刘鹏云. 基于优先级经验回放的 SAC 强化学习算法[J]. 吉林大学学报(信息科学版), 2021, 39(2): 192-199.
[3]	康朝海, 孙超, 荣垂霆, 刘鹏云. 基于动态延迟策略更新的TD3 算法[J]. 吉林大学学报(信息科学版), 2020, 38(4): 474-481.