OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

自动化学报 2012

基于马氏决策过程模型的动态系统学习控制:研究前沿与展望

DOI: 10.3724/SP.J.1004.2012.00673, PP. 673-687

徐昕 ,沈栋, ,高岩青, ,王凯, 5

Keywords: 学习控制,Markov决策过程,增强学习,近似动态规划,机器学习,自适应控制

Full-Text Cite this paper Add to My Lib

Abstract:

？基于马氏决策过程(Markovdecisionprocess,MDP)的动态系统学习控制是近年来一个涉及机器学习、控制理论和运筹学等多个学科的交叉研究方向,其主要目标是实现系统在模型复杂或者不确定等条件下基于数据驱动的多阶段优化控制.本文对基于MDP的动态系统学习控制理论、算法与应用的发展前沿进行综述,重点讨论增强学习(Reinforcementlearning,RL)与近似动态规划(Approximatedynamicprogramming,ADP)理论与方法的研究进展,其中包括时域差值学习理论、求解连续状态与行为空间MDP的值函数逼近方法、直接策略搜索与近似策略迭代、自适应评价设计算法等,最后对相关研究领域的应用及发展趋势进行分析和探讨.

References

[1]	Mahadevan S. Representation policy iteration. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence. Edinburgh, Scotland: AUAI Press, 2005. 372-379
[2]	Konda V R, Tsitsiklis J N. On actor-critic algorithms. SIAM Journal of Control and Optimization, 2001, 42(4): 1143- 1166
[3]	Saeks R, Cox C J, Neidhoefer J, Mays P R, Murray J J. Adaptive control of a hybrid electric vehicle. IEEE Transactions on Intelligent Transportation Systems, 2002, 3(4): 213-234
[4]	Mohagheghi S, del Valle Y, Venayagamoorthy G K, Harley R G. A proportional-integrator type adaptive critic design-based neurocontroller for a static compensator in a multimachine power system. IEEE Transactions on Industrial Electronics, 2007, 54(1): 86-96
[5]	Dalamagkidis K, Kolokotsa D, Kalaitzakis K, Stavrakakis G S. Reinforcement learning for energy conservation and comfort in buildings. Building and Environment, 2007, 42(7): 2686-2698
[6]	Al-Tamimi A, Abu-Khalaf M, Lewis F L. Adaptive critic designs for discrete-time zero-sum games with application to H∞ control. IEEE Transactions on System, Man, and Cybernetics, Part B: Cybernetics, 2007, Automatica, 2007, 37(1): 240-247
[7]	Wei Q L, Zhang H G, Liu D R, Zhao Y. An optimal control scheme for a class of discrete-time nonlinear systems with time delays using adaptive dynamic programming. Acta Automatica Sinica, 2010, 36(1): 121-129
[8]	Zhang H G, Luo Y H, Liu D R. Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints. IEEE Transactions on Neural Networks, 2009, 20(9): 1490-1503
[9]	Hasegawa Y, Fukuda T, Shimojima K. Self-scaling reinforcement learning for fuzzy logic controller-applications to motion control of two-link brachiation robot. IEEE Transactions on Industrial Electronics, 1999, 46(6): 1123-1131
[10]	Xu Xin. Reinforcement Learning and Approximate Dynamic Programming. Beijing: Science Press, 2010 (徐昕. 增强学习与近似动态规划. 北京: 科学出版社, 2010)
[11]	Lin W S, Chang L H, Yang P C. Adaptive critic anti-slip control of wheeled autonomous robot. IET Control Theory and Applications, 2007, 1(1): 51-57
[12]	Chen C L, Li H X, Dong D Y. Hybrid control for robot navigation --- A hierarchical Q-learning algorithm. IEEE Robotics and Automation Magazine, 2008, 15(2): 37-47
[13]	Mohagheghi S, del Valle Y, Venayagamoorthy G K, Harley R G. A proportional-integrator type adaptive critic design-based neurocontroller for a static compensator in a multimachine power system. IEEE Transactions on Industrial Electronics, 2007, 54(1): 86-96
[14]	Mohagheghi S, Venayagamoorthy G K, Harley R G. Adaptive critic design based neuro-fuzzy controller for a static compensator in a multimachine power system. IEEE Transactions on Power Systems, 2006, 21(4): 1744-1754
[15]	Park J W, Harley R G, Venayagamoorthy G K. Adaptive-critic-based optimal neurocontrol for synchronous generators in a power system using MLP/RBF neural networks. IEEE Transactions on Industry Applications, 2003, 39(5): 1529-1540
[16]	Ray S, Venayagamoorthy G K, Watanabe E H. A computational approach to optimal damping controller design for a GCSC. IEEE Transactions on Power Delivery, 2008, 23(3): 1673-1681
[17]	Shih P, Kaul B C, Jagannathan S, Drallmeier J A. Reinforcement-learning-based dual-control methodology for complex nonlinear discrete-time systems with application to spark engine EGR operation. IEEE Transactions on Neural Networks, 2008, 19(8): 1369-1388
[18]	Liu D R, Javaherian H, Kovalenko O, Huang T. Adaptive critic learning techniques for engine torque and air-fuel ratio control. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 988-993
[19]	Padhi R, Balakrishnan S N. Optimal management of beaver population using a reduced-order distributed parameter model and single network adaptive critics. IEEE Transactions on Control Systems Technology, 2006, 14(4): 628-640
[20]	Hwang K S, Chao H J. Adaptive reinforcement learning system for linearization control. IEEE Transactions on Industrial Electronics, 2000, 47(5): 1185-1188
[21]	Yen G G, DeLima P G. Improving the performance of globalized dual heuristic programming for fault tolerant control through an online learning supervisor. IEEE Transactions on Automation Science and Engineering, 2005, 2(2): 121-131
[22]	Iyer M S, Wunsch D C II. Dynamic re-optimization of a fed-batch fermentor using adaptive critic designs. IEEE Transactions on Neural Networks, 2001, 12(6): 1433-1444
[23]	Bertsekas D P, Homer M L, Logan D A, Patek S D, Sandell N R. Missile defense and interceptor allocation by neuro-dynamic programming. IEEE Transactions on System, Man, and Cybernetics, Part A: Systems and Humans, 2000, 30(1): 42-51
[24]	Lin C K. Adaptive critic autopilot design of bank-to-turn missiles using fuzzy basis function networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2005, 35(2): 197-207
[25]	Fakhrazari A, Boroushaki M. Adaptive critic-based neurofuzzy controller for the steam generator water level. IEEE Transactions on Nuclear Science, 2008, 55(3): 1678-1685
[26]	Galstyan A, Czajkowski K, Lerman K. Resource allocation in the grid using reinforcement learning. In: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems. New York, USA: IEEE, 2004. 1314-1315
[27]	Venayagamoorthy G K, Zha W. Comparison of nonuniform optimal quantizer designs for speech coding with adaptive critics and particle swarm. IEEE Transactions on Industry Applications, 2007, 43(1): 238-244
[28]	Zhang Yan-Bing, Hang Da-Ming, Ma Zheng-Xin, Cao Zhi-Gang. A robust active queue management algorithm based on reinforcement learning. Journal of Software, 2004, 15(7): 1090-1098 (张雁冰, 杭大明, 马正新, 曹志刚. 基于再励学习的主动队列管理算法. 软件学报, 2004, 15(7): 1090-1098)
[29]	Liu D R, Zhang Y, Zhang H G. A self-learning call admission control scheme for CDMA cellular networks. IEEE Transactions on Neural Networks, 2005, 16(5): 1219-1228
[30]	Zhang W, Dietterich T G. High-performance job-shop scheduling with a time-delay TD-λ network. In: Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press, 1996. 1024-1030
[31]	Ghavamzadeh M, Mahadevan S. Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 2007, 8: 2629-2669
[32]	Shen Jing. Research on Hierarchical Reinforcement Learning [Ph.D. dissertation], Harbin Engineering University, China, 2006 (沈晶. 分层强化学习方法研究[博士学位论文], 哈尔滨工程大学, 中国, 2006)
[33]	Xu X, Liu C M, Yang S X, Hu D W. Hierarchical approximate policy iteration with binary-tree state space decomposition. IEEE Transactions on Neural Networks, 2011, 22(12): 1863-1877
[34]	Abu-Khalaf M, Lewis F L. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica, 2005, 41(5): 779-791
[35]	Abu-Khalaf M, Lewis F L, Huang J. Neurodynamic programming and zero-sum games for constrained control systems. IEEE Transactions on Neural Networks, 2008, 19(7): 1243-1252
[36]	Mahadevan S, Maggioni M. Proto-value functions: a Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 2007, 8: 2169-2231
[37]	Walsh T J, Goschin S, Littman M L. Integrating sample-based planning and model-based reinforcement learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence. Georgia, USA: AAAI Press, 2010. 612-617
[38]	Wiewiora E. Potential-based shaping and Q-value initialization are equivalent. Journal of Artificial Intelligent Research, 2003, 19(1): 205-208
[39]	Ng A Y, Russell S J. Algorithms for inverse reinforcement learning. In: Proceedings of the 17th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann, 2000. 663-670
[40]	Ramachandran D, Amir E. Bayesian inverse reinforcement learning. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. Hyderabad, India: AAAI Press, 2007. 2586-2591
[41]	Sklansky J. Learning systems for automatic control. IEEE Transactions on Automatic Control, 1966, 11(1): 6-19
[42]	Fu K S. Learning control systems and intelligent control systems: an intersection of artifical intelligence and automatic control. IEEE Transactions on Automatic Control, 1971, 16(1): 70-72
[43]	Bristow D A, Tharayil M, Alleyne A G. A survey of iterative learning control a learning-based method for high-performance tracking control. IEEE Control Systems Magazine, 2006, 26(3): 96-114
[44]	Bertsekas D P. Dynamic Programming and Optimal Control (Volume 2). Belmont, MA: Athena Scientific, 1995
[45]	Sutton R, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998
[46]	Powell W B. Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York: Wiley, 2007
[47]	Liu D R. Approximate dynamic programming for adaptive control. Acta Automatica Sinica, 2005, 31(1): 13-18
[48]	Sutton R S, Barto A G, Williams R J. Reinforcement learning is direct adaptive optimal control. In: Proceedings of the American Control Conference. Waltham, MA: GTE Laboratories Inc., 1991. 2143-2146
[49]	Wang F Y, Saridis G N. Suboptimal control for nonlinear stochastic systems. In: Proceedings of the 31st IEEE Conference on Decision and Control. Tucson, Arizona, USA: IEEE, 1992. 1856-1861
[50]	Wang F Y, Saridis G N. On successive approximation of optimal control of stochastic dynamic systems. Modeling Uncertainty: International Series in Operations Research and Management Science. New York, NY: Springer, 2005. 333-358
[51]	Prokhorov D V, Wunsch D C II. Adaptive critic designs. IEEE Transactions on Neural Networks, 1997, 8(5): 997- 1007
[52]	Saridis G N [Author], Zheng Ying-Ping [Translator]. Self-Organizing Control of Stochastic Systems. Beijing: Science Press, 1984 (Saridis G N [著], 郑应平 [译]. 随机系统的自组织控制. 北京: 科学出版社, 1984)
[53]	Arimoto S, Kawamura S, Miyazaki F. Bettering operation of robots by learning. Journal of Robotic Systems, 1984, 1(2): 123-140
[54]	Wang Y Q, Gao F R, Doyle F J III. Survey on iterative learning control, repetitive control, and run-to-run control. Journal of Process Control, 2009, 19(10): 1589-1600
[55]	Saab S S. Selection of the learning gain matrix of an iterative learning control algorithm in presence of measurement noise. IEEE Transactions on Automatic Control, 2005, 50(11): 1761-1774
[56]	Saab S S. A discrete-time stochastic learning control algorithm. IEEE Transactions on Automatic Control, 2001, 46(6): 877-887
[57]	Tan K K, Zhao S, Huang S, Lee T H, Tay A. A new repetitive control for LTI systems with input delay. Journal of Process Control, 2009, 19(4): 711-716
[58]	Pipeleers G, Demeulenaere B, Al-Bender F, De Schutter J, Swevers J. Optimal performance tradeoffs in repetitive control: experimental validation on an active air bearing setup. IEEE Transactions on Control Systems Technology, 2009, 17(4): 970-979
[59]	Werbos P J. Neural networks for control and system identification. In: Proceedings of the 28th IEEE Conference on Decision and Control. Tampa, USA: IEEE, 1989. 260-265
[60]	Narendra K S, Parthasarathy K. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1990, 1(1): 4-27
[61]	Yu W. Nonlinear system identification using discrete-time recurrent neural networks with stable learning algorithms. Information Sciences, 2004, 158: 131-147
[62]	Martinez-Ramon M, Rojo-Alvarez J L, Camps-Valls G, Munoz-Mari J, Navia-Vazquez A, Soria-Olivas E, Figueiras-Vidal A R. Support vector machines for nonlinear kernel ARMA system identification. IEEE Transactions on Neural Networks, 2006, 17(6): 1617-1622
[63]	Goethals I, Pelckmans K, Suykens J A K, De Moor B. Subspace identification of Hammerstein systems using least squares support vector machines. IEEE Transactions on Automatic Control, 2005, 50(10): 1509-1519
[64]	Al-Ghanim A. An unsupervised learning neural algorithm for identifying process behavior on control charts and a comparison with supervised learning approaches. Computers and Industrial Engineering, 1997, 32(3): 627-639
[65]	Lee J M, Lee J H. Approximate dynamic programming-based approaches for input-output data-driven control of nonlinear processes. Automatica, 2005, 41(7): 1281-1288
[66]	Seymour B, O'Doherty J P, Dayan P, Koltzenburg M, Jones A K, Dolan R J, Friston K J, Frackowiak R S. Temporal difference models describe higher-order learning in humans. Nature, 2004, 429(6992): 664-667
[67]	Watkins C J C H, Dayan P. Q-Learning. Machine Learning, 1992, 8(3-4): 279-292
[68]	Baird L. Residual algorithms: reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufman Publishers, 1995. 30-37
[69]	Tsitsiklis J N, Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674-690
[70]	Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1-3): 33-57
[71]	Xu X, Xie T, Hu D W, Lu X C. Kernel least-squares temporal difference learning. International Journal of Information Technology, 2005, 11(9): 54-63
[72]	Geramifard A, Bowling M, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the 21st Association for the Advancement of Artificial Intelligence (AAAI) on Artificial Intelligence. Boston, Massachusetts, USA: AAAI Press, 2006. 356-361
[73]	Sutton R S, Maei H R, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th International Conference on Machine Learning. Montreal, Canada: ACM, 2009. 993-1000
[74]	Gao Yang, Chen Shi-Fu, Lu Xin. Research on reinforcement learning technology: a review. Acta Automatica Sinica, 2004, 30(1): 86-100 (高阳, 陈世福, 陆鑫. 强化学习研究综述. 自动化学报, 2004, 30(1): 86-100)
[75]	Sch？lkopf B, Smola A J. Learning with Kernels. Cambridge: MIT Press, 2002
[76]	Lanckriet G R G, Cristianini N, Bartlett P L, El Ghaoui L, Jordan M I. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 2004, 5: 27-72
[77]	Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 2007, 18(4): 973-997
[78]	Baxter J, Bartlett P L. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 2001, 15(1): 319-350
[79]	Schraudolph N N, Yu J, Aberdeen D. Fast online policy gradient learning with SMD gain vector adaptation. In: Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2006. 1185-1192
[80]	Lagoudakis M G, Parr R. Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4: 1107-1149
[81]	Xu X, Liu C M, Hu D W. Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Computing-A Fusion of Foundations, Methodologies and Applications, 15(6): 1055-1070
[82]	Barto A G, Sutton R S, Anderson C W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on System, Man, and Cybernetics, 1983, 13(5): 834-846
[83]	Prokhorov D V, Santiago R A, Wunsch II D C. Adaptive critic designs: a case study for neurocontrol. Neural Networks, 1995, 8(9): 1367-1372
[84]	Ferrari S, Stengel R. Online adaptive critic flight control. Journal of Guidance, Control, and Dynamics, 2004, 27(5): 777-786
[85]	Lu C, Si J, Xie X R. Direct heuristic dynamic programming for damping oscillations in a large power system. IEEE Transactions on System, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 1008-1013
[86]	Al-Tamimi A, Lewis F L, Abu-Khalaf M. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43(3): 473-481
[87]	Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on System, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943-949
[88]	Song R Z, Zhang H G, Luo Y H, Wei Q L. Optimal control laws for time-delay systems with saturating actuators based on heuristic dynamic programming. Neurocomputing, 2010, 73(16-18): 3020-3027
[89]	Enns R, Si J. Apache helicopter stabilization using neural dynamic programming. Journal of Guidance, Control, and Dynamics, 2002, 25(1): 19-25
[90]	Dong D Y, Chen C L, Chu J, Tarn T J. Robust quantum-inspired reinforcement learning for robot navigation. IEEE-ASME Transactions on Mechatronics, 2012, 17(1): 86-97
[91]	Meng J E, Chang D. Obstacle avoidance of a mobile robot using hybrid learning approach. IEEE Transactions on Industrial Electronics, 2005, 52(3): 898-905
[92]	Juang C F, Hsu C H. Reinforcement ant optimized fuzzy controller for mobile-robot wall-following control. IEEE Transactions on Industrial Electronics, 2009, 56(10): 3931- 3940
[93]	Fu K S. Learning control systems: review and outlook. IEEE Transactions on Automatic Control, 1970, 15(2): 210-221
[94]	Saridis G N. Foundations of the theory of intelligent controls. In: Proceedings of the IEEE Workshop on Intelligent Control. New York, USA: IEEE, 1985. 23-28
[95]	Kaelbling L P, Littman M L, Moore A P. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237-285
[96]	Puterman M L. Markov Decision Processes. New York, USA: Wiley, 1994
[97]	Wang F Y, Zhang H G, Liu D R. Adaptive dynamic programming: an introduction. IEEE Computational Intelligence Magazine, 2009, 4(2): 39-47
[98]	Bertsekas D P, Tsitsiklis J N, Siklis J T. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996
[99]	Lewis F L, Vrabie D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 2009, 9(3): 32-50
[100]	Wang F Y, Jin N, Liu D R, Wei Q L. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε -error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24-36
[101]	Saridis G N, Wang F Y. Suboptimal control for nonlinear stochastic systems. Control Theory and Advanced Technology, 1994, 10(4): 847-871
[102]	Murray J J, Cox C J, Lendaris G G, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2002, 32(2): 140-153
[103]	Saridis G N. Self-Organizing Control of Stochastic Systems. New York: M. Dekker, 1977
[104]	Ahn H S, Chen Y Q, Moore K L. Iterative learning control: brief survey and categorization. IEEE Transactions on System, Man, and Cybernetics Part C: Applications and Reviews, 2007, 37(6): 1099-1121
[105]	Sun Ming-Xuan, Wang Dan-Wei, Chen Peng-Nian. Repetitive learning control for finite horizon nonlinear system. Science China: Information Sciences, 2010, 40(3): 433-444 (孙明轩, 王郸维, 陈彭年. 有限区间非线性系统的重复学习控制. 中国科学: 信息科学, 2010, 40(3): 433-444)
[106]	Chen H F, Fang H T. Output tracking for nonlinear stochastic systems by iterative learning control. IEEE Transactions on Automatic Control, 2004, 49(4): 583-588
[107]	Chen H F. Almost sure convergence of iterative learning control for stochastic systems. Science in China Series F: Information Sciences, 2003, 46(1): 69-79
[108]	Quan Q, Yang D, Cai K Y, Jiang J. Repetitive control by output error for a class of uncertain time-delay systems. IET Control Theory and Applications, 2009, 3(9): 1283-1292
[109]	Wu M, Zhou L, She J H. Design of observer-based H∞ robust repetitive-control system. IEEE Transactions on Automatic Control, 2011, 56(6): 1452-1457
[110]	Antsaklis P J. Neural networks for control systems. IEEE Transactions on Neural Networks, 1990, 1(2): 242-244
[111]	Liu G P. Nonlinear Identification and Control: A Neural Network Approach. New York: Springer-Verlag, 2001
[112]	Goethals I, Pelckmans K, Suykens J A K, De Moor B. Identification of MIMO Hammerstein models using least squares support vector machines. Automatica, 2005, 41(7): 1263- 1272
[113]	Wang X D, Ye M Y. Nonlinear dynamic system identification using least squares support vector machine regression. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. Shanghai, China: IEEE, 2004. 941-945
[114]	Du J Y, Wang M. Nonlinear dead zone system identification based on support vector machine. In: Proceedings of the 6th International Symposium on Neural Networks. Wuhan, China: Springer, 2009. 235-243
[115]	Le Tallec Y. Robust, Risk-Sensitive, and Data-Driven Control of Markov Decision Processes [Ph.D. dissertation], Massachusetts Institute of Technology, USA, 2007
[116]	Sutton R S. Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3(1): 9-44
[117]	Xu X. A sparse kernel-based least-squares temporal difference algorithm for reinforcement learning. In: Proceedings of 2006 International Conference on Natural Computation. Yantai, China: Springer, 2006. 47-56
[118]	Singh S P, Jaakkola T, Littman M L, Szepesvári C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 2000, 38(3): 287-308
[119]	Xu X, He H G. Residual-gradient-based neural reinforcement learning for the optimal control of an acrobot. In: Proceedings of the IEEE International Symposium on Intelligent Control. Vancouver, Canada: IEEE, 2002. 758-763
[120]	Boyan J A. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49(2-3): 233-246
[121]	Xu X, He H G, Hu D W. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16: 259-292
[122]	Engel Y, Mannor S, Meir R. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 2004, 52(8): 2275-2285
[123]	Johns J, Petrik M, Mahadevan S. Hybrid least-squares algorithms for approximate policy evaluation. Machine Learning, 2009, 76(2-3): 243-256
[124]	Xu Xin, He Han-Gen. A gradient algorithm for neural-network-based reinforcement learning. Chinese Journal of Computers, 2003, 26(2): 227-233 (徐昕, 贺汉根. 神经网络增强学习的梯度算法研究. 计算机学报, 2003, 26(2): 227-233)
[125]	Heger M. The loss from imperfect value functions in expectation-based and minimax-based tasks. Machine Learning, 1996, 22(1-3): 197-225
[126]	Vapnik V N. Statistical Learning Theory. New York: Wiley-Interscience, 1998
[127]	Ormnoneit D, Sen S. Kernel-based reinforcement learning. Machine Learning, 2002, 49(2-3): 161-178
[128]	Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3-4): 229-256
[129]	Wang Xue-Ning, Xu Xin, Wu Tao, He Han-Gen. The optimal reward baseline for policy-gradient reinforcement learning. Chinese Journal of Computers, 2005, 28(6): 1021-1026 (王学宁, 徐昕, 吴涛, 贺汉根. 策略梯度强化学习中的最优回报基线. 计算机学报, 2005, 28(6): 1021-1026)
[130]	Sutton R S, McAllester D, Singh S, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 12. Cambridge, MA: MIT Press, 2000. 1057-1063
[131]	Ghavamzadeh M, Engel Y. Bayesian policy gradient algorithms. In: Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press, 2007. 457-464
[132]	Crites R H, Barto A G. Elevator group control using multiple reinforcement learning agents. Machine Learning, 1998, 33(2-3): 235-262
[133]	Schaerf A, Shoham Y, Tennenholtz M. Adaptive load balancing: a study in multi-agent learning. Journal of Artificial Intelligence Research, 1995, 2: 475-500
[134]	Boyan J, Moore A W. Learning evaluation functions to improve optimization by local search. Journal of Machine Learning Research, 2001, 1: 77-112
[135]	Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems --- Theory and Applications, 2003, 13(1-2): 41-77
[136]	Hengst B. Discovering Hierarchy in Reinforcement Learning [Ph.D. dissertation]. University of New South Wales, Australia, 2003
[137]	Deb A K, Jayadeva G M, Chandra S. SVM-based tree-type neural networks as a critic in adaptive critic designs for control. IEEE Transactions on Neural Networks, 2007, 18(4): 1016-1030
[138]	Abu-Khalaf M, Lewis F L, Huang J. Policy iterations on the Hamilton-Jacobi-Isaacs equation for H∞ state feedback control with input saturation. IEEE Transactions on Automatic Control, 2006, 51(12): 1989-1995
[139]	Ong C S, Smola A J, Williamson R C. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 2005, 6: 1043-1071
[140]	Sutton R S, Szepesvári C, Geramifard A, Bowling M. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence. Helsinki, Finland: AUAI Press, 2008. 528-536
[141]	Ng A Y, Harada D, Russell S. Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning. Bled, Slovenia: Morgan Kaufmann, 1999. 278-287
[142]	Laud A, DeJong G. Reinforcement learning and shaping: encouraging intended behaviors. In: Proceedings of the 19th International Conference on Machine Learning. Sydney, Australia: Morgan Kaufmann, 2002. 355-362
[143]	Saksida L M, Raymond S M, Touretsky D S. Shaping robot behavior using principles from instrumental conditioning. Robotics and Autonomous Systems, 1998, 22(3-4): 231-249

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133