In this paper a new framework, called Compressive Kernelized Reinforcement Learning (CKRL), for computing near-optimal policies in sequential decision making with uncertainty is proposed via incorporating the non-adaptive data-independent Random Projections and nonparametric Kernelized Least-squares Policy Iteration (KLSPI). Random Projections are a fast, non-adaptive dimensionality reduction framework in which high-dimensionality data is projected onto a random lower-dimension subspace via spherically random rotation and coordination sampling. KLSPI introduce kernel trick into the LSPI framework for Reinforcement Learning, often achieving faster convergence and providing automatic feature selection via various kernel sparsification approaches. In this approach, policies are computed in a low-dimensional subspace generated by projecting the high-dimensional features onto a set of random basis. We first show how Random Projections constitute an efficient sparsification technique and how our method often converges faster than regular LSPI, while at lower computational costs. Theoretical foundation underlying this approach is a fast approximation of Singular Value Decomposition (SVD). Finally, simulation results are exhibited on benchmark MDP domains, which confirm gains both in computation time and in performance in large feature spaces.
Maillard, O.A.; Ghavamzadeh, M.; Lazaric, A.; Munos, R. LSTD with random projections. Proceedings of the International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010.
[3]
Bradtke, S.; Barto, A. Linear least-squares algorithms for temporal difference learning. Mach. Learn 1996, 22, 33–57.
[4]
Liu, B.; Mahadevan, S. Compressed reinforcement learning with random projections. Proceedings of 25th Conference on Artificial Intelligence, Vancouver, BC, Canada, 6–9 December 2011.
Jung, T.; Polani, D. Kernelizing LSPE (λ). IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007), Honolulu, HI, USA, 1–5 April 2007; pp. 338–345.
[7]
Nedic, A.; Bertsekas, D. Least-squares policy evaluation algorithms with linear function approximation. Discrete Event Syst. J 2003, 13, 79–110, doi:10.1023/A:1022192903948.
[8]
Engel, Y; Mannor, S; Meir, R. Reinforcement learning with Gaussian processes. Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 201–208.
[9]
Li, S.; Meng, M.; Chen, W. SP-NN: A novel neural network approach for path planning. IEEE International Conference on Robotics and Biomimetics, Sanya, China, 15–18 December 2007; pp. 1355–1360.
[10]
Rasmussen, C. E.; Kuss, M. Gaussian processes in reinforcement learning. In Advances in Neural Information Processing Systems; Vancouver, BC, Canada, 2004; pp. 751–759.
[11]
Farahmand, A.M.; Szepesvi, C.; Ghavamzadeh, M.; Mannor, S. Regularized policy iteration. Proceedings of the International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008.
[12]
Engel, Y.; Mannor, S.; Meir, R. The kernel recursive least squares algorithm. IEEE Trans. Sign. Process 2003, 52, 2275–2285.
[13]
Blum, A. Random projection, margins, kernels, and feature-selection. LNCS 2006, 3940, 52–68.
[14]
Nadaraya, E. A. On estimating regression. Theory Probab. Appl 1964, 9, 141–142, doi:10.1137/1109020.
[15]
Taylor, G.; Parr, R. Kernelized value function approximation for reinforcement learning. Proceedings of 27th International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1017–1024.
[16]
Zhou, S.; Lafferty, J.; Wasserman, L. Compressed and privacy-sensitive sparse regression. IEEE Trans. Inform. Theory 2009, 55, 846–866, doi:10.1109/TIT.2008.2009605.
[17]
Puterman, M. L. Markov Decision Processes; Wiley Interscience: New York, NY, USA; p. 1994.
[18]
Maillard, O.A.; Munos, R. Compressed least-squares regression. Proceedings of Advances in Neural Information Processing Systems 22, Vancouver, BC, Canada, 12–15 December 2009; pp. 1213–1221.
[19]
Kolter, J. Z.; Ng, A. Y. Regularization and feature selection in least-squares temporal difference learning. Proceedings of 27th International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009.
[20]
Mahadevan, S.; Liu, B. Basis construction from power series expansions of value functions. In Advances in Neural Information Processing Systems 23; Vancouver, BC, Canada, 2010; pp. 1540–1548.
[21]
Bingham, E.; Mannila, H. Random projection in dimensionality reduction: Applications to image and text data. In Knowledge Discovery and Data Mining; ACM Press: San Francisco, CA, USA, 2001; pp. 245–250.
[22]
Martinsson, P. G.; Halko, N.; Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. arXiv 2010. arXiv:0909.4061v2.
[23]
Lagoudakis, M.G.; Parr, R. Least-squares policy iteration. J. Mach. Learn. Rese 2003, 4, 1107–1149.
[24]
Liu, B.; Li, S.; Lou, Y.; Chen, S.; Liang, Y. A Hierarchical learning architecture with multiple-goal representations and multiple time-scale based on approximate dynamic programming. Neural Comput. Appl 2012. in press.
[25]
Liu, B.; He, H.; Daniel, W. R. Two-time-scale online actor-critic paradigm driven by POMDP. Proceedings of International Conference on Networking, Sensing and Control (ICNSC), Chicago, IL, USA, 10–12 April 2010; pp. 243–248.