We introduce a reinforcement learning architecture designed for problems with an infinite number of states, where each state can be seen as a vector of real numbers and with a finite number of actions, where each action requires a vector of real numbers as parameters. The main objective of this architecture is to distribute in two actors the work required to learn the final policy. One actor decides what action must be performed; meanwhile, a second actor determines the right parameters for the selected action. We tested our architecture and one algorithm based on it solving the robot dribbling problem, a challenging robot control problem taken from the RoboCup competitions. Our experimental work with three different function approximators provides enough evidence to prove that the proposed architecture can be used to implement fast, robust, and reliable reinforcement learning algorithms. 1. Introduction Applying reinforcement learning (RL) to solve real-world robotic problems is certainly not so common nowadays mainly because most RL methods require several training episodes to learn an optimal policy. This condition supposes having a robot performing a task several thousand times, as it learns through reinforcement learning. In addition to the time required for the training process, we must also consider the time we must spend calibrating sensors and actuators, and the possible damage the robots may suffer. Therefore, one common approach is to first try to solve difficult problems with continuous states and actions in simulated environments, where even the noise of real sensors and actuators can be simulated. In this paper we propose a novel RL architecture for continuous state and actions spaces. Such an architecture was tested with a difficult control problem in the official simulator of the RoboCup [1]. The Robot World Cup or RoboCup for short is an international tournament taking place every year since 1997, each year in a different country. The RoboCup is known up to date as a standard and challenging problem for artificial intelligence and robotics. The most important goal of RoboCup is to advance the overall technological level of society, and as a more pragmatic goal to achieve the following. By mid-twenty-first century, a team of fully autonomous humanoid robot soccer players shall win the soccer game, complying with the official rule of the FIFA, against the winner of the most recent World Cup. One of the competitions in this tournament is the simulation league. In this category two teams of eleven virtual soccer players each play for ten
References
[1]
RoboCup, “The goals of robocup. robocup federation on,” 2001, http://www.robocup.org/about-robocup/objective/.
[2]
M. Gollin, Implementation einer bibliothek für reinforcement learning und anwendung in der robocup simulationsliga [M.S. thesis], Humboldt University of Berlin, Berlin, Germany, 2005.
[3]
M. Riedmiller and T. Gabel, “On experiences in a complex and competitive gaming domain: reinforcement learning meets RoboCup,” in Proceedings of the 3rd IEEE Symposium on Computational Intelligence and Games (CIG '07), pp. 17–23, April 2007.
[4]
A. Cherubini, F. Giannone, L. Iocchi, M. Lombardo, and G. Oriolo, “Policy gradient learning for a humanoid soccer robot,” Robotics and Autonomous Systems, vol. 57, no. 8, pp. 808–818, 2009.
[5]
J. Leng and C. P. Lim, “Reinforcement learning of competitive and cooperative skills in soccer agents,” Applied Soft Computing Journal, vol. 11, no. 1, pp. 1353–1362, 2011.
[6]
P. Stone, G. Kuhlmann, M. E. Taylor, and Y. Liu, “Keepaway soccer: from machine learning testbed to benchmark,” in RoboCup-2005: Robot Soccer World Cup IX, I. Noda, A. Jacoff, A. Bredenfeld, and Y. Takahashi, Eds., pp. 93–105, Springer, Berlin, Germany, 2006.
[7]
H. Montazeri, S. Moradi, and R. Safabakhsh, “Continuous state/action reinforcement learning: a growing self-organizing map approach,” Neurocomputing, vol. 74, no. 7, pp. 1069–1082, 2011.
[8]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, The MIT Press, Cambridge, Mass, USA, 1998.
[9]
R. H. Crites and A. G. Barto, “An actor/critic algorithm that is equivalent to q-learning,” in Advances in Neural Information Processing Systems, 1995.
[10]
L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” Journal of Artificial Intelligence Research, vol. 4, pp. 237–285, 1996.
[11]
V. Heidrich-Meisner, M. Lauer, C. Igel, and M. Riedmiller, “Reinforcement learning in a nutshell,” in Proceedings of the 15th European Symposium on Artificial Neural Networks, 2007.
[12]
R. A. Howard, Dynamic Programming and Markov Processes, The MIT Press, Cambridge, Mass, USA, 1960.
[13]
S. Ross, Introduction to Stochastic Dynamic Programming, Academic Press, New York, NY, USA, 1983.
[14]
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996.
[15]
M. Puterman, Markov Decision Processes, John Wiley & Sons, New York, NY, USA, 1994.
[16]
A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no. 1-2, pp. 41–77, 2003.
[17]
R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[18]
S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Convergence results for single-step on-policy reinforcement-learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 287–308, 2000.
[19]
V. Uc-Cetina, “Multilayer perceptrons with radial basis functions as value functions in reinforcement learning,” in Proceedings of the European Symposium on Artificial Neural Networks, 2008.