全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...
PLOS ONE  2009 

Reinforcement Learning or Active Inference?

DOI: 10.1371/journal.pone.0006421

Full-Text   Cite this paper   Add to My Lib

Abstract:

This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

References

[1]  Rescorla RA, Wagner AR (1972) A theory of Pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In: Black AH, Prokasy WF, editors. Classical Conditioning II: Current Research and Theory. New York: Appleton Century Crofts. pp. 64–99.
[2]  Bellman R (1952) On the Theory of Dynamic Programming, Proceedings of the National Academy 38: 716–719.
[3]  Sutton RS, Barto AG (1981) Toward a modern theory of adaptive networks: expectation and prediction. Psychol Rev Mar;88(2): 135–70.
[4]  Watkins CJCH, Dayan P (1992) Q-learning. Machine Learning 8: 279–292.
[5]  Friston KJ, Tononi G, Reeke GN Jr, Sporns O, Edelman GM (1994) Value-dependent selection in the brain: simulation in a synthetic neural model. Neuroscience Mar; 59(2): 229–43.
[6]  Todorov E (2006) Linearly-solvable Markov decision problems. In Advances in Neural Information Processing Systems. 19. : 1369–1376, Scholkopf, et al (eds), MIT Press.
[7]  Daw ND, Doya K (2006) The computational neurobiology of learning and reward. Curr Opin Neurobiol Apr;16(2): 199–204.
[8]  Camerer CF (2003) Behavioural studies of strategic thinking in games. Trends Cogn Sci May; 7(5): 225–231.
[9]  Friston K, Kilner J, Harrison L (2006) A free-energy principle for the brain. J Physiol Paris 100(1–3): 70–87.
[10]  Friston K (2005) A theory of cortical responses. Philos Trans R Soc Lond B Biol Sci Apr 29; 360(1456): 815–36.
[11]  Sutton RS (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8. pp. 1038–1044.
[12]  Maturana HR, Varela F (1972) De máquinas y seres vivos. Santiago, Chile: Editorial Universitaria. English version: “Autopoiesis: the organization of the living,” in Maturana, HR, and Varela, FG, 1980. Autopoiesis and Cognition. Dordrecht, Netherlands: Reidel.
[13]  Friston KJ, Trujillo-Barreto N, Daunizeau J (2008) DEM: A variational treatment of dynamic systems. NeuroImage Jul 1; 41(3): 849–85.
[14]  Schweitzer F (2003) Brownian Agents and Active Particles: Collective Dynamics in the Natural and Social Sciences. Series: Springer Series in Synergetics. 1st ed. 2003. 2nd printing, 2007 ISBN: 978-3-540-73844-2.
[15]  Linsker R (1990) Perceptual neural organisation: some approaches based on network models and information theory. Annu Rev Neurosci 13: 257–81.
[16]  Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381: 607–609.
[17]  Anosov DV (2001) Ergodic theory, in Hazewinkel, Michiel, Encyclopaedia of Mathematics, Kluwer Academic Publishers, ISBN 978-1556080104.
[18]  Feynman RP (1972) Statistical mechanics. Benjamin, Reading MA, USA.
[19]  Hinton GE, von Cramp D (1993) Keeping neural networks simple by minimising the description length of weights. In: Proceedings of COLT-93. pp. 5–13.
[20]  MacKay DJC (1995) Free-energy minimisation algorithm for decoding and cryptoanalysis. Electronics Letters 31: 445–447.
[21]  Helmholtz H (1860/1962) Handbuch der physiologischen optik. In: Southall JPC, editor. English trans. Vol. 3. New York: Dover.
[22]  Barlow HB (1969) Pattern recognition and the responses of sensory neurons. Ann NY Acad Sci 156: 872–881.
[23]  Ballard DH, Hinton GE, Sejnowski TJ (1983) Parallel visual computation. Nature 306: 21–6.
[24]  Mumford D (1992) On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biol. Cybern 66: 241–51.
[25]  Dayan P, Hinton GE, Neal RM (1995) The Helmholtz machine. Neural Computation 7: 889–904.
[26]  Rao RP, Ballard DH (1998) Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive field effects. Nature Neuroscience 2: 79–87.
[27]  Lee TS, Mumford D (2003) Hierarchical Bayesian inference in the visual cortex. J Opt Soc Am Opt Image Sc Vis 20: 1434–48.
[28]  Knill DC, Pouget A (2004) The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci Dec; 27(12): 712–9.
[29]  Kersten D, Mamassian P, Yuille A (2004) Object perception as Bayesian inference. Annu Rev Psychol 55: 271–304.
[30]  Friston K, Stephan KE (2007) Free energy and the brain Synthese 159: 417–458.
[31]  Deneve S (2008) Bayesian spiking neurons I: Inference. Neural Computation 20(1): 91–117.
[32]  Verschure PF, Voegtlin T, Douglas RJ (2003) Environmentally mediated synergy between perception and behaviour in mobile robots. Nature 425: 620–624.
[33]  W?rg?tter F, Porr B (2005) Temporal sequence learning, prediction, and control: a review of different models and their relation to biological mechanisms. Neural Comput 2005 Feb; 17(2): 245–319.
[34]  Najemnik J, Geisler WS (2008) Eye movement statistics in humans are consistent with an optimal search strategy. J Vis Mar 7; 8(3): 4.1–14.
[35]  Evans DJ (2003) A non-equilibrium free-energy theorem for deterministic systems. Molecular Physics 101: 15551–1554.
[36]  Gontar V (2000) Entropy principle of extremality as a driving force in the discrete dynamics of complex and living systems. Chaos, Solitons and Fractals 11: 231–236.
[37]  Tschacher W, Haken H (2007) Intentionality in non-equilibrium systems? The functional aspects of self-organised pattern formation. New Ideas in Psychology 25: 1–15.
[38]  Verschure PF, Voegtlin T (1998) A bottom up approach towards the acquisition and expression of sequential representations applied to a behaving real-world device: Distributed Adaptive Control III. Neural Netw Oct; 11(7–8): 1531–1549.
[39]  Friston K (2008) Hierarchical models in the brain. PLoS Comput Biol Nov; 4(11): e1000211. PMID: 18989391.
[40]  Ozaki T (1992) A bridge between nonlinear time-series models and nonlinear stochastic dynamical systems: A local linearization approach. Statistica Sin 2: 113–135.
[41]  Manoonpong P, Geng T, Kulvicius T, Porr B, W?rg?tter F (2007) Adaptive, fast walking in a biped robot under neuronal control and learning. PLoS Comput Biol. 2007 Jul; 3(7): e134.
[42]  Prinz AA (2006) Insights from models of rhythmic motor systems. Curr Opin Neurobiol 2006 Dec; 16(6): 615–20.
[43]  Demetrius L (2000) Thermodynamics and evolution. J Theor Biol Sep 7; 206(1): 1–16.
[44]  Traulsen A, Claussen JC, Hauert C (2006) Coevolutionary dynamics in large, but finite populations. Phys Rev E Stat Nonlin Soft Matter Phys Jul; 74(1 Pt 1): 011901.
[45]  Tipping ME (2001) Sparse Bayesian learning and the Relevance Vector Machine. J. Machine Learning Research 1: 211–244.
[46]  Friston K, Mattout J, Trujillo-Barreto N, Ashburner J, Penny W (2007) Variational free energy and the Laplace approximation. NeuroImage Jan 1; 34(1): 220–34.
[47]  Abbott LF, Varela JA, Sen K, Nelson SB (1997) Synaptic depression and cortical gain control. Science Jan 10; 275(5297): 220–4.
[48]  Yu AJ, Dayan P (2005) Uncertainty, neuromodulation and attention. Neuron 46: 681–692.
[49]  Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275: 1593–1599.
[50]  Gillies A, Arbuthnott G (2000) Computational models of the basal ganglia. Movement Disorders 15(5): 762–770.
[51]  Schultz W (1998) Predictive reward signal of dopamine neurons. Journal of Neurophysiology 80(1): 1–27.
[52]  Kakade S, Dayan P (2002) Dopamine: Generalization and bonuses. Neural Networks 15(4–6): 549–559.
[53]  Horvitz JC (2000) Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events. Neuroscience 96(4): 651–656.
[54]  Doya K (2002) Metalearning and neuromodulation. Neural Networks 15(4–6): 495–506.
[55]  Redgrave P, Gurney K (2006) The short-latency dopamine signal: A role in discovering novel actions? Nature Reviews Neuroscience 7(12): 967–975.
[56]  Montague PR, Dayan P, Person C, Sejnowski TJ (1995) Bee foraging in uncertain environments using predictive Hebbian learning. Nature Oct 26; 377(6551): 725–8.
[57]  Kiebel SJ, Daunizeau J, Friston KJ (2008) A hierarchy of time-scales and the brain. PLoS Comput Biol Nov;4(11):e1000209. PMID 19008936.
[58]  Wolpert DM, Ghahramani Z, Jordan MI (1995) An internal model for sensorimotor integration. Science 269(5232): 1880–1882.
[59]  Shadmehr R, Krakauer JW (2008) A computational neuroanatomy for motor control. Exp Brain Res Mar; 185(3): 359–81.
[60]  Wei K, Kording KP (2008) Relevance of error: what drives motor adaptation? J Neurophysiol Nov 19. [Epub ahead of print].
[61]  Kulviciusa T, Porr B, W?rg?tter F (2007) Development of receptive fields in a closed-loop behavioural system. Neurocomputing 70: 2046–2049.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133