OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

自动化学报 2013

基于高斯回归的连续空间多智能体跟踪学习

DOI: 10.3724/SP.J.1004.2013.02021, PP. 2021-2031

陈鑫, 魏海军, 吴敏, 曹卫华

Keywords: 连续状态空间,多智能体系统,基于模型的强化学习,高斯回归

Full-Text Cite this paper Add to My Lib

Abstract:

？提高适应性、实现连续空间的泛化、降低维度是实现多智能体强化学习（Multi-agentreinforcementlearning，MARL）在连续系统中应用的几个关键.针对上述需求，本文提出连续多智能体系统（Multi-agentsystems，MAS）环境下基于模型的智能体跟踪式学习机制和算法（MASMBRL-CPT）.以学习智能体适应同伴策略为出发点，通过定义个体期望即时回报，将智能体对同伴策略的观测融入环境交互效果中，并运用随机逼近实现个体期望即时回报的在线学习.定义降维的Q函数，在降低学习空间维度的同时，建立MAS环境下智能体跟踪式学习的Markov决策过程（Markovdecisionprocess，MDP）.在运用高斯回归建立状态转移概率模型的基础上，实现泛化样本集Q值函数的在线动态规划求解.基于离散样本集Q函数运用高斯回归建立值函数和策略的泛化模型.MASMBRL-CPT在连续空间Multi-cart-pole控制系统的仿真实验表明，算法能够使学习智能体在系统动力学模型和同伴策略未知的条件下，实现适应性协作策略的学习，具有学习效率高、泛化能力强等特点.

References

[1]	Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237-285
[2]	Cheng Yu-Hu, Feng Huan-Ting, Wang Xue-Song. Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph. Acta Automatica Sinica, 2011, 37(1): 44-51 (程玉虎, 冯涣婷, 王雪松. 基于状态-动作图测地高斯基的策略迭代强化学习. 自动化学报, 2011, 37(1): 44-51)
[3]	Busoniu L, De Schutter B, Babu？ka R. Approximate dynamic programming and reinforcement learning. In: Proceedings of the 2010 Interactive Collaborative Information Systems, Studies in Computational Intelligence. Berlin Heidelberg: Springer, 2010, 281: 3-44
[4]	Busoniu L, Ernst D, De Schutter B, Babuska R. Online least-squares policy iteration for reinforcement learning control. In: Proceedings of the 2010 American Control Conference. Baltimore, USA: IEEE, 2010. 486-491
[5]	Jung T, Stone P. Gaussian processes for sample efficient reinforcement learning with RMAX-like exploration. In: Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases, Part I. Berlin, Heidelberg: Springer-Verlag, 2010. 601-616
[6]	Deisenroth M P, Rasmussen C E, Peters J. Gaussian process dynamic programming. Neurocomputing, 2009, 72(7-9): 1508-1524
[7]	Hu J L, Wellman M P. Nash Q-learning for general-sum stochastic games. The Journal of Machine Learning Research, 2003, 4: 1039-1069
[8]	Conitzer V, Sandholm T. AWESOME: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 2007, 67(1-2): 23-43
[9]	Dai Zhao-Hui, Yuan Jiao-Hong, Wu Min, Chen Xin. Dynamic hierarchical reinforcement learning based on probability model. Control Theory and Applications, 2011, 28(11): 1595-1600, 1606 (戴朝晖, 袁姣红, 吴敏, 陈鑫. 基于概率模型的动态分层强化学习. 控制理论与应用, 2011, 28(11): 1595-1600, 1606)
[10]	Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning. Cambridge, MA, USA: The MIT Press, 2006
[11]	Busoniu L, Babuska R, De Schutter B. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2008, 38(2): 156-172
[12]	Chen Xue-Song, Yang Yi-Min. Reinforcement learning: survey of recent work. Application Research of Computers, 2010, 27(8): 2834-2838, 2844 (陈学松, 杨宜民. 强化学习研究综述. 计算机应用研究, 2010, 27(8): 2834-2838, 2844)
[13]	Xu Xin, Shen Dong, Gao Yan-Qing, Wang Kai. Learning control of dynamical systems based on Markov decision processes: research frontiers and outlooks. Acta Automatica Sinica, 2012, 38(5): 673-687 (徐昕, 沈栋, 高岩青, 王凯. 基于马氏决策过程模型的动态系统学习控制: 研究前沿与展望. 自动化学报, 2012, 38(5): 673-687)
[14]	Wang Xue-Song, Tian Xi-Lan, Cheng Yu-Hu, Yi Jian-Qiang. Q-learning system based on cooperative least squares support vector machine. Acta Automatica Sinica, 2009, 35(2): 214-219 (王雪松, 田西兰, 程玉虎, 易建强. 基于协同最小二乘支持向量机的Q学习. 自动化学报, 2009, 35(2): 214-219)
[15]	Rasmussen C E, Kuss M. Gaussian processes in reinforcement learning. In: Proceedings of the 17th Annual Conference on Neural Information Processing Systems. Vancouver, Canada: MIT Press, 2003. 751-759
[16]	Deisenroth M P, Rasmussen C E. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on Machine Learning. Washington, USA, 2011. 465-472
[17]	Wu Jun, Xu Xin, Wang Jian, He Han-Gen. Recent advances of reinforcement learning in multi-robot systems: a survey. Control and Decision, 2011, 26(11): 1601-1610, 1615 (吴军, 徐昕, 王健, 贺汉根. 面向多机器人系统的增强学习研究进展综述. 控制与决策, 2011, 26(11): 1601-1610, 1615)
[18]	Greenwald A, Hall K. Correlated Q-learning. In: Proceedings of the 20th International Conference on Machine Learning. Washington D.C., USA: AAAI Press, 2003. 242-249
[19]	Weinberg M, Rosenschein J S, Paul K. Best-response multiagent learning in non-stationary environments. In: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems. Washington D.C., USA: IEEE, 2004. 506-513
[20]	Chen C L, Li H X, Dong D Y. Hybrid control for robot navigation: a hierarchical Q-learning algorithm. IEEE Robotics and Automation Magazine, 2008, 15(2): 37-47
[21]	Shoham Y, Powers R, Grenager T. Multi-agent Reinforcement Learning: a Critical Survey, Technical Report, Computer Science Department, Stanford University, 2003
[22]	Florian R V. Correct Equations for the Dynamics of the Cart-pole System. Technical Report, Center for Cognitive and Neural Studies, 2007

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133