基于优先级经验回放的 SAC 强化学习算法

吉林大学学报(信息科学版) ›› 2021, Vol. 39 ›› Issue (2): 192-199.

基于优先级经验回放的 SAC 强化学习算法

东北石油大学电气信息工程学院, 黑龙江大庆 163318

收稿日期:2020-06-27 出版日期:2021-04-19 发布日期:2021-04-27
作者简介:刘庆强(1977— ), 男, 黑龙江大庆人, 东北石油大学副教授, 硕士生导师, 主要从事信息安全、智能控制、信号处理与故障诊断研究, (Tel)86-18245922022(E-mail)petroboy@163.com
基金资助:
国家重大科技专项基金资助项目(2017ZX05019-005); 黑龙江自然科学基金资助项目(LH2019F004)

Soft Actor Critic Reinforcement Learning with Prioritized Experience Replay

School of Electrical Engineering and Information, Northeast Petroleum University, Daqing 163318, China

Received:2020-06-27 Online:2021-04-19 Published:2021-04-27

摘要/Abstract

摘要： 针对 SAC(Soft Actor Critic)算法中所有样本都以等概率随机采样, 造成训练速度慢, 训练过程不稳定的缺点, 提出了 PER(Prioritized Experience Replay)-SAC 算法。通过将优先级经验采样引入 SAC 算法, 使网络优先训练值估计函数误差较大和策略表现不好的样本, 从而提高了Agent 训练过程的稳定性与收敛速度。实验结果表明, 在多个环境及优化算法下, PER-SAC 算法在训练速度及稳定性上相比于 SAC 算法均有明显提升。

关键词: 深度强化学习, Actor-Critic 方法, 最大熵, 优先级经验采样

Abstract: SAC(Soft Actor Critic) has the advantages of good robustness, strong exploration ability, and good agent generalization ability. It also has the disadvantage of lower training speed and unstable training process, to solve this problem, we propose the PER(Prioritized Experience Replay)-SAC algorithm by introducing priority experience sampling into the SAC algorithm. Thus the network prioritizes training samples with large error in estimation function and poor strategy performance, which improves the stability and convergence speed of the Agent training process. The experimental results show that the PER-SAC algorithm is better than SAC algorithm in training speed and stability.

Key words: deep reinforcement learning, Actor-Critic, maximum entropy, prioritized experience replay

中图分类号:

TP273

刘庆强, 刘鹏云. 基于优先级经验回放的 SAC 强化学习算法[J]. 吉林大学学报(信息科学版), 2021, 39(2): 192-199.

LIU Qingqiang, LIU Pengyun. Soft Actor Critic Reinforcement Learning with Prioritized Experience Replay[J]. Journal of Jilin University (Information Science Edition), 2021, 39(2): 192-199.