OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Pure Mathematics 2025

保守策略梯度与策略改进
Conservative Policy Gradient and Policy Improvement

DOI: 10.12677/pm.2025.152062, PP. 218-226

黄儒泽

Keywords: 双人非合作马尔可夫博弈，保守策略，策略梯度，策略改进
Two-Player Non-Cooperative Markov Game, Conservative Policy, Policy Gradient, Policy Improvement

Full-Text Cite this paper Add to My Lib

Abstract:

本文在双人非合作马尔科夫博弈模型下，引入了一种策略度量指标，将保守策略推广到了双智能体情形，给出了一种保守策略梯度和策略改进的条件。这为双人非合作博弈中寻找保守策略下的纳什均衡提供了一定基础和改进方向。
In this paper, a policy metric is introduced under the two-player non-cooperative Markov game model, which generalizes the conservative policy to the two-agent case, and gives a conservative policy gradient and the conditions for policy improvement. This provides a certain foundation and improvement direction for finding Nash equilibrium under policy in two-player non-cooperative game.

References

[1]	Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016) Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529, 484-489. https://doi.org/10.1038/nature16961
[2]	Brown, N. and Sandholm, T. (2017) Libratus: The Superhuman AI for No-Limit Poker. Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, 19-25 August 2017, 5226-5228. https://doi.org/10.24963/ijcai.2017/772
[3]	Vinyals, O., Babuschkin, I., Chung, J., et al. (2019) Alphastar: Mastering the Real-Time Strategy Game Starcraft II. DeepMind Blog, 2.
[4]	Kober, J., Bagnell, J.A. and Peters, J. (2013) Reinforcement Learning in Robotics: A Survey. The International Journal of Robotics Research, 32, 1238-1274. https://doi.org/10.1177/0278364913495721
[5]	Wei, E. and Luke, S. (2016) Lenient Learning in Independent-Learner Stochastic Cooperative Games. Journal of Machine Learning Research, 17, 1-42.
[6]	Cui, Q.W. and Du, S.S. (2022) When Are Offline Two-Player Zero-Sum Markov Games Solvable? 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, 28 November-9 December 2022, 25779-25791.
[7]	Yan, Y., Li, G., Chen, Y. and Fan, J. (2024) Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games. Operations Research, 72, 2430-2445. https://doi.org/10.1287/opre.2022.0342
[8]	Sayin, M., et al. (2021) Decentralized Q-Learning in Zero-Sum Markov Games. 35th Conference on Neural Information Processing Systems (NeurIPS 2021), 6-14 December 2021, 18320-18334.
[9]	Yang, Y.D. and Wang, J. (2020) An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective.
[10]	Puterman, M.L. (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
[11]	Kakade, S.M. (2003) On the Sample Complexity of Reinforcement Learning. University of London, University College London.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

保守策略梯度与策略改进Conservative Policy Gradient and Policy Improvement

保守策略梯度与策略改进
Conservative Policy Gradient and Policy Improvement