吉林大学学报(工学版) ›› 2022, Vol. 52 ›› Issue (3): 666-674.doi: 10.13229/j.cnki.jdxbgxb20200842

• 计算机科学与技术 • 上一篇    

面向文本游戏的深度强化学习模型

刘勇(),徐雷,张楚晗   

  1. 黑龙江大学 计算机科学技术学院,哈尔滨 150080
  • 收稿日期:2020-11-03 出版日期:2022-03-01 发布日期:2022-03-08
  • 作者简介:刘勇(1975-),男,副教授,博士. 研究方向:数据挖掘,机器学习. E-mail:liuyong123456@hlju.edu.cn
  • 基金资助:
    国家自然科学基金项目(61972135);黑龙江省自然科学基金项目(LH2020F043)

Deep reinforcement learning model for text games

Yong LIU(),Lei XU,Chu-han ZHANG   

  1. School of Computer Science and Technology,Heilongjiang University,Harbin 150080,China
  • Received:2020-11-03 Online:2022-03-01 Published:2022-03-08

摘要:

为了提高文本游戏中智能体的表现性能,提出了一种基于孪生网络和深度后继表示的深度强化学习模型SADSR。首先,使用自然语言处理技术对文本信息进行处理,得到词的嵌入向量,有效地将文本信息转化为数字向量。然后,利用孪生网络分别对状态和动作信息进行特征提取,通过提取到的状态特征向量预测即时奖励,使用状态和动作的联合向量预测后继表示。最后,将后继表示与特定层的权重向量通过交互函数进行计算得到动作价值。实验结果表明:该模型可以有效对价值函数进行拟合,与目前的主流模型相比,SADSR可以将文本游戏中智能体的表现性能提高10%~60%。

关键词: 人工智能, 文本游戏, 深度强化学习, 孪生网络, 深度后继表示

Abstract:

In order to improve the performance of agents in text games, a deep reinforcement learning model called SADSR based on siamese network and deep successor representation was proposed. Firstly, the model uses natural language processing technology to process text information to obtain the embedding vector of words, which can effectively transform the text information into digital vector. Then the siamese network is used to extract the features of state and action information, and the extracted state feature vector is used to predict the immediate reward, and the joint vector of state and action is used to predict the successor representation. Finally, the action value is calculated by the interaction function between the successor representation and the weight vector of the specific layer. The experimental results show that the model can effectively fit the value function. Compared with the current mainstream models, SADSR can improve the performance of agents in text games by 10%~60%.

Key words: artificial intelligence, text games, deep reinforcement learning, siamese network, deep successor representation

中图分类号: 

  • TP181

图1

SADSR模型结构"

图2

文本表征"

图3

求和树"

表1

Pyfiction平台游戏的统计数据和实验结果"

项目类别实验环境
Saving JohnMachine of Death
StatisticsVocab size≥1119≥2055
Action vocab size≥168≥399
State transitionsDeterministicStochastic
Endings≥5≥15
Optimal final score≥19.4≥28.6
Final RewardRANDOM-11.12(±0.52)-7.26(±12.61)
DRRN1.84(±16.56)11.50(±19.05)
SSAQN19.40(±0.00)12.20(±13.04)
SADSR19.40(±0.00)20.06(±4.70)

图4

智能体在Pyfiction平台的学习过程"

表2

Jericho平台游戏的统计数据和实验结果"

项目类别游戏名称
inhumaneludicorppentari
Statisticstemplets141187155
words409503472
MaxScore30015070
Final RewardRANDOM013.20
NAIL0.68.40
TDQN0.7617.4
DRRN013.827.2
SADSR18(±4.47)16.8(±0.45)30(±0.0)

图5

智能体在Jericho平台的学习过程"

图6

智能体在游戏inhumane中的学习过程"

1 赵亚慧, 杨飞扬, 张振国,等. 基于强化学习和注意力机制的朝鲜语文本结构发现[J]. 吉林大学学报: 工学版, 2021, 51(4): 1387-1395.
Zhao Ya-hui, Yang Fei-yang, Zhang Zhen-guo, et al. Korean text structure discovery based on reinforcement learning and attention mechanism[J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(4): 1387-1395.
2 Coskun-Setirek A, Mardikyan S. Understanding the adoption of voice activated personal assistants[J]. International Journal of E-Services and Mobile Applications, 2017, 9(3): 1-21.
3 Green S, Wang S, Chuang J, et al. Human effort and machine learnability in computer aided translation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1225-1236.
4 Pietquin O, Renals S. ASR system modeling for automatic evaluation and optimization of dialogue systems[C]∥Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, Florida, USA, 2002: 45-48.
5 Dolgov D, Thrun S. Detection of principle directions in unknown environments for autonomous navigation[C]∥Robotics: Science and Systems, Zurich, Switzerland, 2009: 73-80.
6 Sutton R S, Barto A G. Reinforcement Learning: An Introduction[M]. Cambridge, MA, USA: MIT Press, 1998.
7 Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1532-1543.
8 Gershman S, Moore C D, Todd M T, et al. The successor representation and temporal context[J]. Neural Computation, 2012, 24(6): 1553-1568.
9 Kulkarni T D, Saeedi A, Gautam S, et al. Deep successor reinforcement learning[EB/OL].[2020-11-02].
10 Cho K, Merrienboer B V, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014: 1724-1734.
11 Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J/OL]. [2020-11-02].
12 Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[J/OL]. [2020-11-03].
13 He J, Chen J S, He X D, et al. Deep reinforcement learning with a natural language action space[C]∥Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Washington, Seattle, USA, 2016: 1621-1630.
14 Zelinka M. Baselines for reinforcement learning in text games[C]∥IEEE 30th International Conference on Tools with Artificial Intelligence, Volos, Greece, 2018: 320-327.
15 Hausknecht M J, Ammanabrolu P, M-A Côté, et al. Interactive fiction games: a colossal adventure[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 7903-7910.
16 Brockman G, Cheung V, Pettersson L, et al. Openai gym[J/OL]. [2020-11-08].
17 Hausknecht M J, Loynd R, Yang G, et al. NAIL: a general interactive fiction agent[J/OL]. [2020-11-08].
18 Narasimhan K, Kulkarni T D, Barzilay R, et al. Language understanding for text-based games using deep reinforcement learning[C]∥Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015: 1-11.
[1] 雷景佩,欧阳丹彤,张立明. 基于知识图谱嵌入的定义域值域约束补全方法[J]. 吉林大学学报(工学版), 2022, 52(1): 154-161.
[2] 李志华,张烨超,詹国华. 三维水声海底地形地貌实时拼接与可视化[J]. 吉林大学学报(工学版), 2022, 52(1): 180-186.
[3] 欧阳丹彤,张必歌,田乃予,张立明. 结合格局检测与局部搜索的故障数据缩减方法[J]. 吉林大学学报(工学版), 2021, 51(6): 2144-2153.
[4] 徐艳蕾,何润,翟钰婷,赵宾,李陈孝. 基于轻量卷积网络的田间自然环境杂草识别方法[J]. 吉林大学学报(工学版), 2021, 51(6): 2304-2312.
[5] 杨勇,陈强,曲福恒,刘俊杰,张磊. 基于模拟划分的SP⁃k⁃means-+算法[J]. 吉林大学学报(工学版), 2021, 51(5): 1808-1816.
[6] 赵亚慧,杨飞扬,张振国,崔荣一. 基于强化学习和注意力机制的朝鲜语文本结构发现[J]. 吉林大学学报(工学版), 2021, 51(4): 1387-1395.
[7] 董延华,刘靓葳,赵靖华,李亮,解方喜. 基于BPNN在线学习预测模型的扭矩实时跟踪控制[J]. 吉林大学学报(工学版), 2021, 51(4): 1405-1413.
[8] 吕帅,刘京. 基于深度强化学习的随机局部搜索启发式方法[J]. 吉林大学学报(工学版), 2021, 51(4): 1420-1426.
[9] 刘富,梁艺馨,侯涛,宋阳,康冰,刘云. 模糊c-harmonic均值算法在不平衡数据上改进[J]. 吉林大学学报(工学版), 2021, 51(4): 1447-1453.
[10] 尚福华,曹茂俊,王才志. 基于人工智能技术的局部离群数据挖掘方法[J]. 吉林大学学报(工学版), 2021, 51(2): 692-696.
[11] 赵海英,周伟,侯小刚,张小利. 基于多任务学习的传统服饰图像双层标注[J]. 吉林大学学报(工学版), 2021, 51(1): 293-302.
[12] 欧阳丹彤,马骢,雷景佩,冯莎莎. 知识图谱嵌入中的自适应筛选[J]. 吉林大学学报(工学版), 2020, 50(2): 685-691.
[13] 李贻斌,郭佳旻,张勤. 人体步态识别方法与技术[J]. 吉林大学学报(工学版), 2020, 50(1): 1-18.
[14] 徐谦,李颖,王刚. 基于深度学习的行人和车辆检测[J]. 吉林大学学报(工学版), 2019, 49(5): 1661-1667.
[15] 高万夫,张平,胡亮. 基于已选特征动态变化的非线性特征选择方法[J]. 吉林大学学报(工学版), 2019, 49(4): 1293-1300.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!