|
基于PID控制更新的Sarsa强化学习算法及应用
|
Abstract:
针对强化学习中Sarsa算法收敛速度慢且效果不稳定的问题,考虑到PID控制操作简单且鲁棒性高,提出基于PID控制优化的Sarsa算法,即Pid_Sarsa。其主要思想是将Sarsa算法中Q值的迭代方式改进为三项之和,分别对应PID控制中的比例、积分和微分,体现了对当前、过去和未来的误差进行控制的思想,理论上提高了样本利用率。为了对比Pid_Sarsa算法与Sarsa和n_Sarsa(取n = 5)两种传统算法的效果,选择悬崖寻路这一经典路径规划游戏作为算例,实验表明:Pid_Sarsa算法收敛速度更快、效果更稳定,且得到的路径安全程度比Sarsa算法高2.38%,比5步Sarsa算法高4.76%。
In view of the problem of slow convergence speed and unstable effect of Sarsa algorithm in rein-forcement learning, considering the simple operation and high robustness of PID control, we pro-pose a Sarsa algorithm based on PID control optimization, that is, Pid_Sarsa. The main idea is that improve the iterative way of Q values in Sarsa algorithm to the sum of three terms, corresponding to the proportions, integrals and differentiation in PID control, which reflect the idea of controlling current, past and future errors, and theoretically improves sample utilization. In order to compare the effects of the Pid_Sarsa algorithm with the two traditional algorithms of Sarsa and n_Sarsa (n = 5), the classic path planning game of cliff walking is selected as an example, and the experiments show that the Pid_Sarsa algorithm converges faster and the effect is more stable, and the obtained path security degree is 2.38% higher than that of Sarsa algorithm and 4.76% higher than that of 5-step Sarsa algorithm.
[1] | 袁唯淋, 罗俊仁, 陆丽娜, 等. 智能博弈对抗方法: 博弈论与强化学习综合视角对比分析[J]. 计算机科学, 2022, 49(8): 191-204. |
[2] | Sutton, R. (1988) Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3, 9-44.
https://doi.org/10.1007/BF00115009 |
[3] | Tommi, J., Michael, I. and Satinder, P. (1994) Convergence of Stochastic Iterative Dynamic Programming Algorithms. Advances in Neural Information Processing Systems, Denver, 28 November-1 December 1994, 703-710. |
[4] | Robards, M., Sunehag, P., Sanner, S., et al. (2011) Sparse Ker-nel-SARSA(λ) with an Eligibility Trace. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Berlin, 1-17.
https://doi.org/10.1007/978-3-642-23808-6_1 |
[5] | 朱海军. 基于核方法的近似强化学习的研究[D]: [硕士学位论文]. 苏州: 苏州大学,2017. |
[6] | Van, S., Van, H. and Whiteson, S. (2009) A Theoretical and Empirical Analysis of Expected Sarsa. Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville, 30 March-2 April 2009, 177-184. |
[7] | De, A., Hernandez-Garcia, J. and Holland, G. (2018) Multi-Step Reinforcement Learning: A Unifying Algorithm. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, 2-7 February 2018, 2902-2909. |
[8] | 杨瑞. 多步强化学习算法的理论研究[D]: [硕士学位论文]. 天津: 天津大学, 2018. |
[9] | 郭雷. 不确定性动态系统的估计、控制与博弈[J]. 中国科学: 信息科学, 2020, 50(9): 1327-1344. |
[10] | 王芳, 郭雷. 人机融合社会中的系统调控[J]. 系统工程理论与实践, 2020, 40(8): 1935-1944. |
[11] | 陈学松, 杨宜民. 基于执行器-评价器学习的自适应PID控制[J]. 控制理论与应用, 2011, 28(8): 1187-1192. |
[12] | 段友祥, 任辉, 孙歧峰, 闫亚男. 基于异步优势执行器评价器的自适应PID控制[J]. 计算机测量与控制, 2019, 27(2): 70-73+78. |
[13] | 程丽梅, 贾文川. 连续型强化学习与PID控制的应用对比分析: 以一阶倒立摆系统为例[J]. 工业控制计算机, 2021, 34(10): 20-22. |
[14] | 甄岩, 郝明瑞. 基于深度强化学习的智能PID控制方法研究[J]. 战术导弹技术, 2019(5): 37-43. |
[15] | 孙波, 张伟, 杨青, 辛晨. 继电自整定PID控制算法比较研究[J]. 信息技术与信息化, 2021(2): 42-43+46. |
[16] | 李玉忍, 杨金孝, 张兴国, 齐蓉, 林辉. 基于迭代学习的PID控制研究[J]. 计算机工程与科学, 2007(4): 98-100. |
[17] | 吴少波, 杨薛钰. 基于Sarsa算法的交通信号灯控制方法[J]. 信息与电脑(理论版), 2021, 33(6): 49-51. |
[18] | 刘全, 翟建伟, 章宗长, 钟珊, 周倩, 章鹏, 徐进. 深度强化学习综述[J]. 计算机学报, 2018, 41(1): 1-27. |
[19] | 李必文. 线性广义系统的P型、PD型和PID型迭代学习控制[J]. 数学杂志, 2008(6): 667-672. |
[20] | 代明光, 齐蓉, 李兵强, 赵逸云. 具有自适应非线性增益的开环PD型迭代学习控制[J]. 系统工程与电子技术, 2020, 42(3): 660-666. |