{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,6]],"date-time":"2026-05-06T14:36:41Z","timestamp":1778078201127,"version":"3.51.4"},"reference-count":34,"publisher":"MIT Press","issue":"2","license":[{"start":{"date-parts":[[2021,11,11]],"date-time":"2021-11-11T00:00:00Z","timestamp":1636588800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,1,14]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.<\/jats:p>","DOI":"10.1162\/neco_a_01455","type":"journal-article","created":{"date-parts":[[2021,11,10]],"date-time":"2021-11-10T20:09:42Z","timestamp":1636574982000},"page":"307-337","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":43,"title":["A Normative Account of Confirmation Bias During Reinforcement Learning"],"prefix":"10.1162","volume":"34","author":[{"given":"Germain","family":"Lefebvre","sequence":"first","affiliation":[{"name":"MRC Brain Network Dynamics Unit, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford OX3 9DU, U.K. germain.lefebvre@outlook.com"}]},{"given":"Christopher","family":"Summerfield","sequence":"additional","affiliation":[{"name":"Department of Experimental Psychology, University of Oxford, Oxford OX3 9DU, U.K. christopher.summerfield@psy.ox.ac.uk"}]},{"given":"Rafal","family":"Bogacz","sequence":"additional","affiliation":[{"name":"MRC Brain Network Dynamics Unit, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford OX3 9DU, U.K. rafal.bogacz@ndcn.ox.ac.uk"}]}],"member":"281","published-online":{"date-parts":[[2022,1,14]]},"reference":[{"issue":"4","key":"2022040618224457300_B1","doi-asserted-by":"publisher","first-page":"700","DOI":"10.1037\/0033-295X.113.4.700","article-title":"The physics of optimal decision making: A formal analysis of models of performance in two-alternative forced-choice tasks","volume":"113","author":"Bogacz","year":"2006","journal-title":"Psychol. Rev."},{"issue":"6","key":"2022040618224457300_B2","doi-asserted-by":"publisher","first-page":"711","DOI":"10.1007\/s00422-013-0571-5","article-title":"Adaptive properties of differential learning rates for positive and negative outcomes.","volume":"107","author":"Caze","year":"2013","journal-title":"Biol. Cybern."},{"issue":"10","key":"2022040618224457300_B3","doi-asserted-by":"publisher","first-page":"1067","DOI":"10.1038\/s41562-020-0919-5","article-title":"Information about action outcomes differentially affects learning from self-determined versus imposed choices","volume":"4","author":"Chambon","year":"2020","journal-title":"Nature Human Behaviour"},{"issue":"4","key":"2022040618224457300_B4","article-title":"Selective effects of the loss of NMDA or mGluR5 receptors in the reward system on adaptive decision-making.","volume":"5","author":"lak","year":"2018","journal-title":"Eneuro"},{"issue":"3","key":"2022040618224457300_B5","doi-asserted-by":"crossref","DOI":"10.1037\/a0037015","article-title":"Opponent actor learning (OpAL): Modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive","volume":"121","author":"Collins","year":"2014","journal-title":"Psychological Review"},{"issue":"7792","key":"2022040618224457300_B6","doi-asserted-by":"publisher","first-page":"671","DOI":"10.1038\/s41586-019-1924-6","article-title":"A distributional code for value in dopamine-based reinforcement learning","volume":"577","author":"Dabney","year":"2020","journal-title":"Nature"},{"issue":"7095","key":"2022040618224457300_B7","doi-asserted-by":"publisher","first-page":"876","DOI":"10.1038\/nature04766","article-title":"Cortical substrates for exploratory decisions in humans","volume":"441","author":"Daw","year":"2006","journal-title":"Nature"},{"issue":"11","key":"2022040618224457300_B8","doi-asserted-by":"publisher","first-page":"1215","DOI":"10.1038\/s41562-019-0714-3","article-title":"Flexible combination of reward information across primates","volume":"3","author":"Farashahi","year":"2019","journal-title":"Nature Human Behaviour"},{"issue":"12","key":"2022040618224457300_B9","doi-asserted-by":"publisher","first-page":"2066","DOI":"10.1038\/s41593-019-0518-9","article-title":"Computational noise in reward-guided learning drives behavioral variability in volatile environments.","volume":"22","author":"Findling","year":"2019","journal-title":"Nat. Neurosci."},{"issue":"5","key":"2022040618224457300_B10","doi-asserted-by":"publisher","first-page":"1320","DOI":"10.3758\/s13423-014-0790-3","volume":"22","author":"Gershman","year":"2015","journal-title":"Psychon. Bull. Rev."},{"key":"2022040618224457300_B11","author":"Groopman","year":"2007","journal-title":"How doctors think"},{"key":"2022040618224457300_B12","author":"Juechems","year":"2020","journal-title":"Optimal utility and probability functions for agents with finite computational precision."},{"key":"2022040618224457300_B13","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1016\/j.jmp.2018.09.002","article-title":"The statistical structures of reinforcement learning with asymmetric value updates.","volume":"87","author":"Katahira","year":"2018","journal-title":"J. Math. Psychol."},{"issue":"11","key":"2022040618224457300_B14","doi-asserted-by":"publisher","first-page":"2435","DOI":"10.1287\/mnsc.2013.1720","article-title":"Learning from my success and from others' failure: Evidence from minimally invasive cardiac surgery","volume":"59","author":"Kc","year":"2013","journal-title":"Management Science"},{"issue":"5928","key":"2022040618224457300_B15","doi-asserted-by":"publisher","first-page":"759","DOI":"10.1126\/science.1169405","article-title":"Representation of confidence associated with a decision by neurons in the parietal cortex","volume":"324","author":"Kiani","year":"2009","journal-title":"Science"},{"key":"2022040618224457300_B16","doi-asserted-by":"publisher","DOI":"10.1038\/s41562-017-0067","article-title":"Behavioural and neural characterization of optimistic reinforcement learning.","volume":"1","author":"Lefebvre","year":"2017","journal-title":"Nat. Hum. Behav."},{"issue":"8","key":"2022040618224457300_B17","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1005723","article-title":"Robust averaging protects decisions from noise in neural computations.","volume":"13","author":"Li","year":"2017","journal-title":"PLOS Comput. Biol."},{"issue":"9","key":"2022040618224457300_B18","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1005062","article-title":"Learning reward uncertainty in the basal ganglia","volume":"12","author":"Mikhael","year":"2016","journal-title":"PLOS Computational Biology"},{"issue":"2","key":"2022040618224457300_B19","doi-asserted-by":"publisher","first-page":"292","DOI":"10.1037\/rev0000120","article-title":"Habits without values.","volume":"126","author":"Miller","year":"2019","journal-title":"Psychol. Rev."},{"issue":"2","key":"2022040618224457300_B20","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pcbi.1006285","article-title":"Learning the payoffs and costs of actions","volume":"15","author":"M\u00f6ller","year":"2019","journal-title":"PLOS Computational Biology"},{"key":"2022040618224457300_B21","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1037\/1089-2680.2.2.175","article-title":"Confirmation bias: A ubiquitous phenomenon in many guises","volume":"2","author":"Nickerson","year":"1998","journal-title":"Review of General Psychology"},{"issue":"2","key":"2022040618224457300_B22","doi-asserted-by":"publisher","first-page":"551","DOI":"10.1523\/JNEUROSCI.5498-10.2012","article-title":"Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain.","volume":"32","author":"Niv","year":"2012","journal-title":"J. Neurosci."},{"issue":"2","key":"2022040618224457300_B23","doi-asserted-by":"publisher","first-page":"289","DOI":"10.3758\/BF03196492","article-title":"Optimal data selection: revision, review, and reevaluation.","volume":"10","author":"Oaksford","year":"2003","journal-title":"Psychon. Bull. Rev."},{"issue":"8","key":"2022040618224457300_B24","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1005684","article-title":"Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing.","volume":"13","author":"Palminteri","year":"2017","journal-title":"PLOS Comput. Biol."},{"issue":"4","key":"2022040618224457300_B25","doi-asserted-by":"publisher","first-page":"1234","DOI":"10.3758\/s13423-016-1199-y","article-title":"The drift diffusion model as the choice rule in reinforcement learning.","volume":"24","author":"Pedersen","year":"2017","journal-title":"Psychon. Bull. Rev."},{"key":"2022040618224457300_B26","doi-asserted-by":"publisher","first-page":"211","DOI":"10.1016\/j.conb.2014.02.013","article-title":"Variability in neural activity and behavior.","volume":"25","author":"Renart","year":"2014","journal-title":"Curr. Opin. Neurobiol."},{"key":"2022040618224457300_B27","first-page":"64","volume-title":"Classical conditioning II: Current research and theory","author":"Rescorla","year":"1972"},{"key":"2022040618224457300_B28","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1016\/j.cortex.2019.12.027","article-title":"Decreased transfer of value to action in Tourette syndrome.","author":"Schuller","year":"2020","journal-title":"Cortex, 126"},{"issue":"1","key":"2022040618224457300_B29","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1016\/j.tics.2014.11.005","article-title":"Do humans make good decisions?","volume":"19","author":"Summerfield","year":"2015","journal-title":"Trends Cogn. Sci."},{"issue":"19","key":"2022040618224457300_B30","doi-asserted-by":"publisher","first-page":"3128","DOI":"10.1016\/j.cub.2018.07.052","article-title":"Confirmation bias through selective overweighting of choice-consistent evidence","volume":"28","author":"Talluri","year":"2018","journal-title":"Current Biology"},{"key":"2022040618224457300_B31","author":"Tarantola","year":"2021","journal-title":"Confirmation bias optimizes reward learning"},{"issue":"11","key":"2022040618224457300_B32","doi-asserted-by":"publisher","first-page":"3102","DOI":"10.1073\/pnas.1519157113","article-title":"Economic irrationality is optimal during noisy decision making.","volume":"113","author":"Tsetsos","year":"2016","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2022040618224457300_B33","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2013.00640","article-title":"Decomposing the roles of perseveration and expected value representation in models of the Iowa gambling task.","volume":"4","author":"Worthy","year":"2013","journal-title":"Front Psychol."},{"issue":"3","key":"2022040618224457300_B34","doi-asserted-by":"publisher","first-page":"322","DOI":"10.1016\/j.jmp.2010.03.001","article-title":"Bounded Ornstein\u2013Uhlenbeck models for two-choice time controlled tasks.","volume":"54","author":"Zhang","year":"2010","journal-title":"Journal of Mathematical Psychology"}],"container-title":["Neural Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/34\/2\/307\/2006833\/neco_a_01455.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/neco\/article-pdf\/34\/2\/307\/2006833\/neco_a_01455.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,4,6]],"date-time":"2022-04-06T14:22:57Z","timestamp":1649254977000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/neco\/article\/34\/2\/307\/107913\/A-Normative-Account-of-Confirmation-Bias-During"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,14]]},"references-count":34,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2022,1,14]]},"published-print":{"date-parts":[[2022,1,14]]}},"URL":"https:\/\/doi.org\/10.1162\/neco_a_01455","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.05.12.090134","asserted-by":"object"}]},"ISSN":["0899-7667","1530-888X"],"issn-type":[{"value":"0899-7667","type":"print"},{"value":"1530-888X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,2]]},"published":{"date-parts":[[2022,1,14]]}}}