|
Pure Mathematics 2025
保守策略梯度与策略改进
|
Abstract:
本文在双人非合作马尔科夫博弈模型下,引入了一种策略度量指标,将保守策略推广到了双智能体情形,给出了一种保守策略梯度和策略改进的条件。这为双人非合作博弈中寻找保守策略下的纳什均衡提供了一定基础和改进方向。
In this paper, a policy metric is introduced under the two-player non-cooperative Markov game model, which generalizes the conservative policy to the two-agent case, and gives a conservative policy gradient and the conditions for policy improvement. This provides a certain foundation and improvement direction for finding Nash equilibrium under policy in two-player non-cooperative game.
[1] | Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016) Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529, 484-489. https://doi.org/10.1038/nature16961 |
[2] | Brown, N. and Sandholm, T. (2017) Libratus: The Superhuman AI for No-Limit Poker. Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, 19-25 August 2017, 5226-5228. https://doi.org/10.24963/ijcai.2017/772 |
[3] | Vinyals, O., Babuschkin, I., Chung, J., et al. (2019) Alphastar: Mastering the Real-Time Strategy Game Starcraft II. DeepMind Blog, 2. |
[4] | Kober, J., Bagnell, J.A. and Peters, J. (2013) Reinforcement Learning in Robotics: A Survey. The International Journal of Robotics Research, 32, 1238-1274. https://doi.org/10.1177/0278364913495721 |
[5] | Wei, E. and Luke, S. (2016) Lenient Learning in Independent-Learner Stochastic Cooperative Games. Journal of Machine Learning Research, 17, 1-42. |
[6] | Cui, Q.W. and Du, S.S. (2022) When Are Offline Two-Player Zero-Sum Markov Games Solvable? 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, 28 November-9 December 2022, 25779-25791. |
[7] | Yan, Y., Li, G., Chen, Y. and Fan, J. (2024) Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games. Operations Research, 72, 2430-2445. https://doi.org/10.1287/opre.2022.0342 |
[8] | Sayin, M., et al. (2021) Decentralized Q-Learning in Zero-Sum Markov Games. 35th Conference on Neural Information Processing Systems (NeurIPS 2021), 6-14 December 2021, 18320-18334. |
[9] | Yang, Y.D. and Wang, J. (2020) An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective. |
[10] | Puterman, M.L. (2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. |
[11] | Kakade, S.M. (2003) On the Sample Complexity of Reinforcement Learning. University of London, University College London. |