Q-learning is a popular temporal-difference reinforcement learning algorithm which often explicitly stores state values using lookup tables. This implementation has been proven to converge to the optimal solution, but it is often beneficial to use a function-approximation system, such as deep neural networks, to estimate state values. It has been previously observed that Q-learning can be unstable when using value function approximation or when operating in a stochastic environment. This instability can adversely affect the algorithm’s ability to maximize its returns. In this paper, we present a new algorithm called Multi Q-learning to attempt to overcome the instability seen in Q-learning. We test our algorithm on a 4 × 4 grid-world with different stochastic reward functions using various deep neural networks and convolutional networks. Our results show that in most cases, Multi Q-learning outperforms Q-learning, achieving average returns up to 2.5 times higher than Q-learning and having a standard deviation of state values as low as 0.58.
References
[1]
Sutton, R.S. and Barto, A.G. (1998) Reinforcement Learning: An Introduction, Vol. 1. MIT Press, Cambridge.
[2]
Hasselt, H.V. (2010) Double Q-Learning. Advances in Neural Information Processing Systems, 2613-2621. http://papers.nips.cc/paper/3964-double-q-learning
[3]
Watkins, C.J.C.H. and Dayan, P. (1992) Q-Learning. Machine Learning, 8, 279-292.
http://dx.doi.org/10.1007/BF00992698
[4]
Baird, L. (1995) Residual Algorithms: Reinforcement Learning with Function Approximation. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, California, 9-12 July 1995, 30-37.
[5]
Tsitsiklis, J.N. and Van Roy, B. (1997) An Analysis of Temporal-Difference Learning with Function Approximation. IEEE Transactions on Automatic Control, 42, 674-690.
http://dx.doi.org/10.1109/9.580874
[6]
Maei, H.R., et al. (2011) Gradient Temporal-Difference Learning Algorithms.
http://incompleteideas.net/rlai609/slides/gradient%20TD%20slides.pdf
[7]
Sutton, R.S., Mahmood, A.R. and White, M. (2015) An Emphatic Approach to the Problem of Off-Policy Temporal-Difference Learning. The Journal of Machine Learning Research, 17, 1-29.
[8]
Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556
[9]
Collobert, R. and Weston, J. (2008) A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. Proceedings of the 25th International Conference on Machine Learning, Helsinki, 5-9 July 2008, 160-167.
[10]
Sun, Y., Wang, X. and Tang, X. (2014) Deep Learning Face Representation from Predicting 10,000 Classes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 1891-1898. http://dx.doi.org/10.1109/9.580874
[11]
LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep Learning. Nature, 521, 436-444.
http://dx.doi.org/10.1038/nature14539
[12]
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016) Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature, 529, 484-489.
http://dx.doi.org/10.1038/nature16961
[13]
Tesauro, G. (1995) Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38, 58-68. http://dx.doi.org/10.1145/203330.203343
[14]
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al. (2015) Human-Level Control through Deep Reinforcement Learning. Nature, 518, 529-533.
http://dx.doi.org/10.1038/nature14236
[15]
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015) Continuous Control with Deep Reinforcement Learning.
arXiv preprint arXiv:1509.02971
[16]
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M. (2013) Playing Atari with Deep Reinforcement Learning.
arXiv preprint arXiv:1312.5602
[17]
Van Hasselt, H., Guez, A. and Silver, D. (2015) Deep Reinforcement Learning with Double Q-Learning. arXiv:1509.06461 [cs.LG]
[18]
Tieleman, T. and Hinton, G. (2012) Lecture 6.5-rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 4, 26-31.