吉林大学学报(信息科学版) ›› 2020, Vol. 38 ›› Issue (4): 474-481.

• • 上一篇    下一篇

基于动态延迟策略更新的TD3 算法

康朝海,孙超,荣垂霆,刘鹏云   

  1. 东北石油大学电气信息工程学院,黑龙江大庆163318
  • 收稿日期:2020-01-17 出版日期:2020-07-24 发布日期:2020-08-13
  • 作者简介:康朝海( 1976— ) ,男,黑龙江望奎人,东北石油大学副教授,硕士生导师,主要从事智能算法与智能控制研究,( Tel)86-459-6503373( E-mail) kangchaohai@126. com。
  • 基金资助:
    黑龙江省自然科学基金资助项目( E2018004)

TD3 Algorithm with Dynamic Delayed Policy Update

KANG Chaohai,SUN Chao,RONG Chuiting,LIU Pengyun   

  1. School of Electrical Engineering and Information,Northeast Petroleum University,Daqing 163318,China
  • Received:2020-01-17 Online:2020-07-24 Published:2020-08-13

摘要: 在深度强化学习领域中,为进一步减少双延迟深度确定性策略梯度TD3( Twin Delayed Deep Deterministic
Policy Gradients) 中价值过估计对策略估计的影响,加快模型学习的效率,提出一种基于动态延迟策略更新的双
延迟深度确定性策略梯度( DD-TD3: Twin Delayed Deep Deterministic Policy Gradients with Dynamic Delayed Policy
Update) 。在DD-TD3 方法中,通过Critic 网络的最新Loss 值与其指数加权移动平均值的动态差异指导Actor 网
络的延迟更新步长。实验结果表明,与原始TD3 算法在2 000 步获得较高的奖励值相比,DD-TD3 方法可在约
1 000步内学习到最优控制策略,并且获得更高的奖励值,从而提高寻找最优策略的效率。

关键词: 深度强化学习, TD3 算法, 动态延迟策略更新

Abstract: In the field of deep reinforcement learning, in order to further reduce the impact of value
overestimation on policy estimation in TD3 ( Twin Delayed Deep Deterministic Policy Gradients) and accelerate
the efficiency of model learning,a DD-TD3 ( Twin Delayed Deep Deterministic Policy Gradients with Dynamic
Delayed Policy Update) is proposed. The delay update step size of the actor network is guided by the dynamic
difference between the latest loss of the critic network and its exponential weighted moving average. Experimental
results show that compared to the original TD3 algorithm that obtain high reward value in the 2 000 steps,the
DD-TD3 method can learn the optimal control strategy in about 1 000 steps and obtain a higher reward value,
thereby the efficiency of finding the optimal strategy is improved.

Key words: deep reinforcement learning, twin delayed deep deterministic policy gradients ( TD3) , dynamic delayed policy update

中图分类号: 

  • TP273