{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,15]],"date-time":"2026-04-15T21:44:02Z","timestamp":1776289442332,"version":"3.50.1"},"reference-count":358,"publisher":"Emerald","issue":"3-4","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,12,20]]},"abstract":"<jats:p>Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decision making tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.<\/jats:p>","DOI":"10.1561\/2200000071","type":"journal-article","created":{"date-parts":[[2018,12,20]],"date-time":"2018-12-20T11:02:46Z","timestamp":1545303766000},"page":"219-354","source":"Crossref","is-referenced-by-count":991,"title":["An Introduction to Deep Reinforcement Learning"],"prefix":"10.1561","volume":"11","author":[{"given":"Vincent","family":"Fran\u00e7ois-Lavet","sequence":"first","affiliation":[{"name":"McGill University"}]},{"given":"Peter","family":"Henderson","sequence":"additional","affiliation":[{"name":"McGill University"}]},{"given":"Riashat","family":"Islam","sequence":"additional","affiliation":[{"name":"McGill University"}]},{"given":"Marc G.","family":"Bellemare","sequence":"additional","affiliation":[{"name":"Google Brain"}]},{"given":"Pineau","family":"Joelle","sequence":"additional","affiliation":[{"name":"Facebook, McGill University"}]}],"member":"140","published-online":{"date-parts":[[2018,12,20]]},"reference":[{"key":"2026033012244512800_ref001","volume-title":"\u201cTensor Flow: Large-Scale Machine Learning on Heterogeneous Distributed Systems\u201d","author":"Abadi,","year":"2016"},{"key":"2026033012244512800_ref002","volume-title":"In: Proceedings of the twenty-first international conference on Machine learning","author":"Abbeel,","year":"2004"},{"issue":"2","key":"2026033012244512800_ref003","doi-asserted-by":"crossref","first-page":"251","DOI":"10.1162\/089976698300017746","article-title":"\u201cNatural Gradient Works Efficiently in Learning\u201d","volume":"10","author":"Amari","year":"1998","journal-title":"Neural Computation"},{"key":"2026033012244512800_ref004","volume-title":"\u201cConcrete problems in AI safety\u201d","author":"Amodei","year":"2016"},{"key":"2026033012244512800_ref005","volume-title":"An introduction to multivariate statistical analysis","author":"Anderson","year":"1958"},{"key":"2026033012244512800_ref006","volume-title":"\u201cPlaying hard exploration games bywatching YouTube\u201d","author":"Aytar","year":"2018"},{"key":"2026033012244512800_ref007","volume-title":"\u201cThe option-critic architecture\u201d","author":"Bacon","year":"2016"},{"key":"2026033012244512800_ref008","volume-title":"\u201cAn actor-critic algorithm for sequence prediction\u201d","author":"Bahdanau","year":"2016"},{"key":"2026033012244512800_ref009","first-page":"30","volume-title":"\u201cResidual algorithms: Reinforcement learning with function approximation\u201d","author":"Baird","year":"1995"},{"issue":"7604","key":"2026033012244512800_ref010","doi-asserted-by":"crossref","DOI":"10.1038\/533452a","article-title":"\u201c1,500 scientists lift the lid on reproducibility\u201d","volume":"533","author":"Baker","year":"2016","journal-title":"Nature News"},{"key":"2026033012244512800_ref011","first-page":"463","article-title":"\u201cRademacher and Gaussian complexities: Risk bounds and structural results\u201d","volume":"3","author":"Bartlett","year":"2002","journal-title":"Journal of Machine Learning Research"},{"issue":"5","key":"2026033012244512800_ref012","doi-asserted-by":"crossref","first-page":"834","DOI":"10.1109\/TSMC.1983.6313077","article-title":"\u201cNeuronlike adaptive elements that can solve difficult learning control problems\u201d","author":"Barto","year":"1983","journal-title":"IEEE transactions on systems, man, and cybernetics"},{"key":"2026033012244512800_ref013","volume-title":"arXiv:1612.03801","author":"Beattie","year":"2016"},{"key":"2026033012244512800_ref014","volume-title":"\u201cDopamine\u201d","author":"Bellemare","year":"2018"},{"key":"2026033012244512800_ref015","volume-title":"\u201cA distributional perspective on reinforcement learning\u201d","author":"Bellemare","year":"2017"},{"key":"2026033012244512800_ref016","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1613\/jair.3912","article-title":"\u201cThe Arcade Learning Environment: An evaluation platform for general agents","volume":"47","author":"Bellemare","year":"2013","journal-title":"Journal of Artificial Intel ligence Research"},{"key":"2026033012244512800_ref017","volume-title":"\u201cUnifying Count-Based Exploration and Intrinsic Motivation\u201d","author":"Bellemare","year":"2016"},{"key":"2026033012244512800_ref018","first-page":"679","volume-title":"Journal of Mathematics and Mechanics","author":"Bellman","year":"1957"},{"key":"2026033012244512800_ref019","author":"Bellman","year":"1957"},{"key":"2026033012244512800_ref020","author":"Bellman","year":"1962"},{"key":"2026033012244512800_ref021","volume-title":"arXiv:1611.09940","author":"Bello","year":"2016"},{"key":"2026033012244512800_ref022","volume-title":"arXiv:1709.08568","author":"Bengio","year":"2017"},{"key":"2026033012244512800_ref023","volume-title":"arXiv:1502.04156","author":"Bengio","year":"2015"},{"key":"2026033012244512800_ref024","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1145\/1553374.1553380","volume-title":"Proceedings of the 26th annual international conference on machine learning","author":"Bengio","year":"2009"},{"key":"2026033012244512800_ref025","first-page":"9","volume-title":"Artificial intel ligence in medicine","author":"Bennett","year":"2013"},{"key":"2026033012244512800_ref026","volume-title":"Dynamic programming and optimal control","author":"Bertsekas","year":"1995"},{"key":"2026033012244512800_ref027","volume-title":"arXiv:1604.07316","author":"Bojarski","year":"2016"},{"key":"2026033012244512800_ref028","volume-title":"Superintelligence","author":"Bostrom","year":"2017"},{"key":"2026033012244512800_ref029","first-page":"51","volume-title":"Proceedings of the 20th International Conference on Machine Learning (ICML-03)","author":"Bouckaert","year":"2003"},{"key":"2026033012244512800_ref030","first-page":"3","volume-title":"PAKDD","author":"Bouckaert","year":"2004"},{"key":"2026033012244512800_ref031","first-page":"182","volume-title":"AISTATS","author":"Boularias","year":"2011"},{"key":"2026033012244512800_ref032","first-page":"369","volume-title":"Advances in neural information processing systems","author":"Boyan","year":"1995"},{"key":"2026033012244512800_ref033","first-page":"213","article-title":"\u201cR-max-a general polynomial time algorithm for near-optimal reinforcement learning\u201d","volume":"3","author":"Brafman","year":"2003","journal-title":"The Journal of Machine Learning Research"},{"key":"2026033012244512800_ref034","first-page":"126","volume-title":"Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1","author":"Branavan","year":"2012"},{"key":"2026033012244512800_ref035","volume-title":"University of Toronto, Tech. Rep","author":"Braziunas","year":"2003"},{"key":"2026033012244512800_ref036","author":"Brockman","year":"2016","journal-title":"\u201cOpenAI Gym\u201d"},{"key":"2026033012244512800_ref037","volume-title":"International Joint Conference on Artificial Intel ligence (IJCAI-17)","author":"Brown","year":"2017"},{"issue":"1","key":"2026033012244512800_ref038","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TCIAIG.2012.2186810","article-title":"\u201cA survey of monte carlo tree search methods\u201d","volume":"4","author":"Browne","year":"2012","journal-title":"IEEE Transactions on Computational Intelligence and AI in games"},{"key":"2026033012244512800_ref039","volume-title":"Tech. rep.","author":"Br\u00fcgmann","year":"1993"},{"key":"2026033012244512800_ref040","volume-title":"et al.","author":"Brundage"},{"key":"2026033012244512800_ref041","doi-asserted-by":"crossref","first-page":"2315","DOI":"10.1109\/IJCNN.2014.6889732","volume-title":"Neural Networks (IJCNN), 2014 International Joint Conference on","author":"Brys","year":"2014"},{"issue":"19","key":"2026033012244512800_ref042","doi-asserted-by":"crossref","first-page":"1832","DOI":"10.1016\/j.tcs.2010.12.059","article-title":"\u201cPure exploration in finitely-armed and continuous-armed bandits\u201d","volume":"412","author":"Bubeck","year":"2011","journal-title":"Theoretical Computer Science"},{"key":"2026033012244512800_ref043","volume-title":"arXiv:1810.12894","author":"Burda","year":"2018"},{"issue":"1","key":"2026033012244512800_ref044","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1257\/0022051053737843","article-title":"\u201cNeuroeconomics: How neuroscience can inform economics\u201d","volume":"43","author":"Camerer","year":"2005","journal-title":"Journal of economic Literature"},{"issue":"1-2","key":"2026033012244512800_ref045","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1016\/S0004-3702(01)00129-1","article-title":"\u201cDeep blue\u201d","volume":"134","author":"Campbell","year":"2002","journal-title":"Artificial intel ligence"},{"key":"2026033012244512800_ref046","doi-asserted-by":"crossref","DOI":"10.1128\/IAI.00908-10","volume-title":"\u201cReproducible science\u201d","author":"Casadevall","year":"2010"},{"key":"2026033012244512800_ref047","volume-title":"9th International Conference on Agents and Artificial Intel ligence (ICAART 2017)","author":"Castronovo","year":"2017"},{"key":"2026033012244512800_ref048","volume-title":"arXiv:1810.05687","author":"Chebotar","year":"2018"},{"key":"2026033012244512800_ref049","volume-title":"arXiv:1511.05641","author":"Chen","year":"2015"},{"key":"2026033012244512800_ref050","volume-title":"arXiv:1706.01284","author":"Chen","year":"2017"},{"key":"2026033012244512800_ref051","volume-title":"arXiv:1704.02254","author":"Chiappa","year":"2017"},{"key":"2026033012244512800_ref052","volume-title":"arXiv:1706.03741","author":"Christiano","year":"2017"},{"key":"2026033012244512800_ref053","volume-title":"Pattern recognition and machine learning.","author":"Christopher","year":"2006"},{"issue":"1481","key":"2026033012244512800_ref054","doi-asserted-by":"crossref","first-page":"933","DOI":"10.1098\/rstb.2007.2098","article-title":"\u201cShould I stay or should I go? How the human brain manages the trade-off between exploitation and exploration\u201d","volume":"362","author":"Cohen","year":"2007","journal-title":"Philosophical Transactions of the Royal Society of London B: Biological Sciences"},{"issue":"3","key":"2026033012244512800_ref055","doi-asserted-by":"crossref","first-page":"273","DOI":"10.1023\/A:1022627411411","article-title":"\u201cSupport-vector networks\u201d","volume":"20","author":"Cortes","year":"1995","journal-title":"Machine learning"},{"key":"2026033012244512800_ref056","unstructured":"Coumans, E.,Y.Bai, et al.2016. \u201cBullet\u201d.http:\/\/pybullet.org\/."},{"key":"2026033012244512800_ref057","unstructured":"Da Silva, B., G.Konidaris, and A.Barto. 2012. \u201cLearning parameterized skills\u201d. arXiv preprint arXiv:1206.6398."},{"key":"2026033012244512800_ref058","volume-title":"arXiv:1710.10044","author":"Dabney","year":"2017"},{"issue":"4","key":"2026033012244512800_ref059","doi-asserted-by":"crossref","first-page":"429","DOI":"10.3758\/CABN.8.4.429","article-title":"\u201cDecision theory, reinforcement learning, and the brain\u201d","volume":"8","author":"Dayan","year":"2008","journal-title":"Cognitive, Affective, Behavioral Neuroscience"},{"issue":"2","key":"2026033012244512800_ref060","doi-asserted-by":"crossref","first-page":"185","DOI":"10.1016\/j.conb.2008.08.003","article-title":"#x201C;Reinforcement learning: the good, the bad and the ugly \u201d","volume":"18","author":"Dayan","year":"2008","journal-title":"Current opinion in neurobiology"},{"key":"2026033012244512800_ref061","first-page":"150","volume-title":"Proceedings of the Fifteenth conference on Uncertainty in artificial intel ligence","author":"Dearden","year":"1999"},{"key":"2026033012244512800_ref062","volume-title":"\u201cBayesian Q-learning \u201d","author":"Dearden","year":"1998"},{"key":"2026033012244512800_ref063","first-page":"465","volume-title":"Proceedings of the 28th International Conference on machine learning (ICML-11)","author":"Deisenroth","year":"2011"},{"key":"2026033012244512800_ref064","first-page":"1","article-title":"\u201cStatistical comparisons of classifiers over multiple data sets \u201d","volume":"7","author":"Dem\u0161ar","year":"2006","journal-title":"Journal of Machine learning research"},{"issue":"3","key":"2026033012244512800_ref065","doi-asserted-by":"crossref","first-page":"653","DOI":"10.1109\/TNNLS.2016.2522401","article-title":"\u201cDeep direct reinforcement learning for financial signal representation and trading \u201d","volume":"28","author":"Deng","year":"2017","journal-title":"IEEE transactions on neural networks and learning systems"},{"key":"2026033012244512800_ref066","volume-title":"\u201cOpenAI Baselines \u201d","author":"Dhariwal","year":"2017"},{"issue":"7","key":"2026033012244512800_ref067","doi-asserted-by":"crossref","first-page":"1895","DOI":"10.1162\/089976698300017197","article-title":"\u201cApproximate statistical tests for comparing supervised classification learning algorithms \u201d","volume":"10","author":"Dietterich","year":"1998","journal-title":"Neural computation"},{"key":"2026033012244512800_ref068","first-page":"1","volume-title":"Asian Conference on Machine Learning","author":"Dietterich","year":"2009"},{"key":"2026033012244512800_ref069","first-page":"895","volume-title":"Proceedings of the 27th International Conference on Machine Learning (ICML-10)","author":"Dinculescu","year":"2010"},{"key":"2026033012244512800_ref070","volume-title":"arXiv:1611.01779","author":"Dosovitskiy","year":"2016"},{"key":"2026033012244512800_ref071","volume-title":"arXiv:1703.07326","author":"Duan","year":"2017"},{"key":"2026033012244512800_ref072","first-page":"1329","volume-title":"International Conference on Machine Learning","author":"Duan","year":"2016"},{"key":"2026033012244512800_ref073","volume-title":"arXiv:1611.02779","author":"Duan","year":"2016"},{"key":"2026033012244512800_ref074","volume-title":"PowerTech Manchester 2017 Proceedings","author":"Duchesne","year":"2017"},{"issue":"1-2","key":"2026033012244512800_ref075","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1023\/A:1007694015589","article-title":"\u201cRelational reinforcement learning \u201d","volume":"43","author":"D\u017eeroski","year":"2001","journal-title":"Machine learning"},{"issue":"3","key":"2026033012244512800_ref076","first-page":"1","article-title":"\u201cVisualizing higher-layer features of a deep network \u201d","volume":"1341","author":"Erhan","year":"2009","journal-title":"University of Montreal"},{"key":"2026033012244512800_ref077","first-page":"503","volume-title":"Journal of Machine Learning Research","author":"Ernst","year":"2005"},{"key":"2026033012244512800_ref078","volume-title":"arXiv:1710.11417","author":"Farquhar","year":"2017"},{"key":"2026033012244512800_ref079","volume-title":"arXiv:1712.04034","author":"Fazel-Zarandi","year":"2017"},{"key":"2026033012244512800_ref080","volume-title":"arXiv:1703.03400","author":"Finn","year":"2017"},{"key":"2026033012244512800_ref081","first-page":"64","volume-title":"Advances In Neural Information Processing Systems","author":"Finn","year":"2016"},{"key":"2026033012244512800_ref082","article-title":"\u201cGuided cost learning: Deep inverse optimal control via policy optimization \u201d","volume":"48","author":"Finn","year":"2016","journal-title":"Proceedings of the 33rd International Conference on Machine Learning"},{"key":"2026033012244512800_ref083","volume-title":"arXiv:1704.03012","author":"Florensa","year":"2017"},{"key":"2026033012244512800_ref084","first-page":"1514","volume-title":"International Conference on Machine Learning","author":"Florensa","year":"2018"},{"key":"2026033012244512800_ref085","first-page":"122","volume-title":"Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems","author":"Foerster","year":"2018"},{"key":"2026033012244512800_ref086","volume-title":"arXiv:1705.08926","author":"Foerster","year":"2017"},{"key":"2026033012244512800_ref087","volume-title":"arXiv:1702.08887","author":"Foerster","year":"2017"},{"issue":"1","key":"2026033012244512800_ref088","doi-asserted-by":"crossref","first-page":"383","DOI":"10.1007\/s10479-012-1248-5","article-title":"\u201cBatch mode reinforcement learning based on the synthesis of artificial trajectories \u201d","volume":"208","author":"Fonteneau","year":"2013","journal-title":"Annals of operations research"},{"key":"2026033012244512800_ref089","volume-title":"\u201cVariable selection for dynamic treatment regimes: a reinforcement learning approach \u201d","author":"Fonteneau","year":"2008"},{"key":"2026033012244512800_ref090","volume-title":"arXiv:1706.10295","author":"Fortunato","year":"2017"},{"key":"2026033012244512800_ref091","author":"Fox","year":"2015","journal-title":"arXiv:1512.08562"},{"key":"2026033012244512800_ref092","article-title":"\u201cContributions to deep reinforcement learning and its applications in smartgrids \u201d","author":"Fran\u00e7ois-Lavet","year":"2017","journal-title":"PhD thesis"},{"key":"2026033012244512800_ref093","unstructured":"Fran\u00e7ois-Lavet, V.\n          \n          et al.\n          2016. \u201cDeeR \u201d. https:\/\/deer.readthedocs.io\/."},{"key":"2026033012244512800_ref094","volume-title":"arXiv:1809.04506","author":"Fran\u00e7ois-Lavet","year":"2018"},{"key":"2026033012244512800_ref095","volume-title":"arXiv:1709.07796","author":"Fran\u00e7ois-Lavet","year":"2017"},{"key":"2026033012244512800_ref096","volume-title":"arXiv:1512.02011","author":"Fran\u00e7ois-Lavet","year":"2015"},{"key":"2026033012244512800_ref097","volume-title":"European Workshop on Reinforcement Learning","author":"Fran\u00e7ois-Lavet","year":"2016"},{"key":"2026033012244512800_ref098","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1007\/978-3-642-46466-9_18","author":"Fukushima","year":"1982","journal-title":"Competition and cooperation in neural nets"},{"key":"2026033012244512800_ref099","first-page":"1050","author":"Gal","year":"2016","journal-title":"Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016"},{"key":"2026033012244512800_ref100","author":"Gandhi","year":"2017","journal-title":"arXiv:1704.05588"},{"key":"2026033012244512800_ref101","author":"Garnelo","year":"2016","journal-title":"arXiv:1609.05518"},{"key":"2026033012244512800_ref102","author":"Gauci","year":"2018","journal-title":"arXiv:1811.00260"},{"key":"2026033012244512800_ref103","volume-title":"0 \u201cModification of UCT with patterns in Monte-Carlo Go \u201d","author":"Gelly","year":"2006"},{"issue":"1","key":"2026033012244512800_ref104","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1162\/neco.1992.4.1.1","article-title":"\u201cNeural networks and the bias\/variance dilemma \u201d","volume":"4","author":"Geman","year":"1992","journal-title":"Neural computation"},{"key":"2026033012244512800_ref105","first-page":"1573","article-title":"\u201cRLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research \u201d","volume":"16","author":"Geramifard","year":"2015","journal-title":"Journal of Machine Learning Research"},{"issue":"1","key":"2026033012244512800_ref106","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/s10994-006-6226-1","article-title":"\u201cExtremely randomized trees \u201d","volume":"63","author":"Geurts","year":"2006","journal-title":"Machine learning"},{"issue":"5-6","key":"2026033012244512800_ref107","doi-asserted-by":"crossref","first-page":"359","DOI":"10.1561\/2200000049","article-title":"\u201cBayesian reinforcement learning: A survey \u201d","volume":"8","author":"Ghavamzadeh","year":"2015","journal-title":"Foundations and Trends\u00ae in Machine Learning"},{"issue":"2","key":"2026033012244512800_ref108","doi-asserted-by":"crossref","first-page":"661","DOI":"10.1109\/LRA.2015.2509024","article-title":"\u201cA machine learning approach to visual perception of forest trails for mobile robots \u201d","volume":"1","author":"Giusti","year":"2016","journal-title":"IEEE Robotics and Automation Letters"},{"key":"2026033012244512800_ref109","author":"Goodfellow","year":"2016","journal-title":"Deep learning"},{"key":"2026033012244512800_ref110","first-page":"2672","author":"Goodfellow","year":"2014","journal-title":"Advances in neural information processing systems"},{"key":"2026033012244512800_ref111","first-page":"1052","author":"Gordon","year":"1996","journal-title":"Advances in neural information processing systems"},{"key":"2026033012244512800_ref112","article-title":"\u201cApproximate solutions to Markov decision processes \u201d","volume":"228","author":"Gordon","year":"1999","journal-title":"Robotics Institute"},{"key":"2026033012244512800_ref113","author":"Graves","year":"2014","journal-title":"arXiv:1410.5401"},{"key":"2026033012244512800_ref114","author":"Gregor","year":"2016","journal-title":"arXiv:1611.07507"},{"key":"2026033012244512800_ref115","author":"Gruslys","year":"2017","journal-title":"arXiv:1704.04651"},{"key":"2026033012244512800_ref116","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1109\/ICRA.2017.7989385","author":"Gu","year":"2017","journal-title":"Robotics and Automation (ICRA), 2017 IEEE International Conference on"},{"key":"2026033012244512800_ref117","volume-title":"5th International Conference on Learning Representations (ICLR 2017)","author":"Gu","year":"2017"},{"key":"2026033012244512800_ref118","author":"Gu","year":"2016","journal-title":"arXiv:1611.02247"},{"key":"2026033012244512800_ref119","volume-title":"arXiv:1706.00387","author":"Gu","year":"2017"},{"key":"2026033012244512800_ref120","volume-title":"arXiv:1603.00748","author":"Gu","year":"2016"},{"key":"2026033012244512800_ref121","volume-title":"arXiv:1703.03454","author":"Guo","year":"2017"},{"key":"2026033012244512800_ref122","volume-title":"arXiv:1702.08165","author":"Haarnoja","year":"2017"},{"key":"2026033012244512800_ref123","volume-title":"arXiv:1802.07442","author":"Haber","year":"2018"},{"key":"2026033012244512800_ref124","first-page":"3909","volume-title":"Advances in neural information processing systems","author":"Hadfield-Menell","year":"2016"},{"issue":"1-2","key":"2026033012244512800_ref125","doi-asserted-by":"crossref","first-page":"137","DOI":"10.1007\/s10994-011-5235-x","article-title":"\u201cReinforcement learning in feedback control \u201d","volume":"84","author":"Hafner","year":"2011","journal-title":"Machine learning"},{"issue":"3","key":"2026033012244512800_ref126","doi-asserted-by":"crossref","first-page":"179","DOI":"10.1038\/nmeth.3288","article-title":"\u201cThe fickle P value generates irreproducible results \u201d","volume":"12","author":"Halsey","year":"2015","journal-title":"Nature methods"},{"key":"2026033012244512800_ref127","volume-title":"Sapiens: A brief history of humankind","author":"Harari","year":"2014"},{"key":"2026033012244512800_ref128","doi-asserted-by":"crossref","first-page":"305","DOI":"10.1007\/978-3-319-46379-7_21","volume-title":"International Conference on Algorithmic Learning Theory","author":"Harutyunyan","year":"2016"},{"issue":"2","key":"2026033012244512800_ref129","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1016\/j.neuron.2017.06.011","article-title":"\u201cNeuroscience-inspired artificial intelligence \u201d","volume":"95","author":"Hassabis","year":"2017","journal-title":"Neuron"},{"key":"2026033012244512800_ref130","first-page":"2613","volume-title":"Advances in Neural Information Processing Systems","author":"Hasselt","year":"2010"},{"key":"2026033012244512800_ref131","volume-title":"arXiv:1507.06527","author":"Hausknecht","year":"2015"},{"key":"2026033012244512800_ref132","first-page":"220","volume-title":"Proceedings of the Fourteenth conference on Uncertainty in artificial intel ligence","author":"Hauskrecht","year":"1998"},{"key":"2026033012244512800_ref133","first-page":"770","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"He","year":"2016"},{"key":"2026033012244512800_ref134","first-page":"2944","volume-title":"Advances in Neural Information Processing Systems","author":"Heess","year":"2015"},{"key":"2026033012244512800_ref135","volume-title":"ICML Lifelong Learning: A Reinforcement Learning Approach Workshop","author":"Henderson","year":"2017"},{"key":"2026033012244512800_ref136","volume-title":"arXiv:1709.06560","author":"Henderson","year":"2017"},{"key":"2026033012244512800_ref137","author":"Hessel","year":"2017","journal-title":"arXiv:1710.02298"},{"key":"2026033012244512800_ref138","volume-title":"arXiv:1809.04474","author":"Hessel","year":"2018"},{"key":"2026033012244512800_ref139","volume-title":"arXiv:1707.08475","author":"Higgins","year":"2017"},{"key":"2026033012244512800_ref140","first-page":"4565","volume-title":"Advances in Neural Information Processing Systems","author":"Ho","year":"2016"},{"issue":"8","key":"2026033012244512800_ref141","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"\u201cLong short-term memory \u201d","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural computation"},{"key":"2026033012244512800_ref142","first-page":"87","volume-title":"International Conference on Artificial Neural Networks","author":"Hochreiter","year":"2001"},{"issue":"4","key":"2026033012244512800_ref143","doi-asserted-by":"crossref","first-page":"679","DOI":"10.1037\/0033-295X.109.4.679","article-title":"\u201cThe neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. \u201d","volume":"109","author":"Holroyd","year":"2002","journal-title":"Psychological review"},{"key":"2026033012244512800_ref144","first-page":"1109","volume-title":"Advances in Neural Information Processing Systems","author":"Houthooft","year":"2016"},{"key":"2026033012244512800_ref145","volume-title":"arXiv:1502.03167","author":"Ioffe","year":"2015"},{"key":"2026033012244512800_ref146","volume-title":"ICML Reproducibility in Machine Learning Workshop","author":"Islam","year":"2017"},{"key":"2026033012244512800_ref147","volume-title":"arXiv:1807.01281","author":"Jaderberg","year":"2018"},{"key":"2026033012244512800_ref148","volume-title":"arXiv:1611.05397","author":"Jaderberg","year":"2016"},{"key":"2026033012244512800_ref149","first-page":"704","volume-title":"European Conference on Artificial Life","author":"Jakobi","year":"1995"},{"issue":"2","key":"2026033012244512800_ref150","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1023\/A:1022899518027","article-title":"\u201cVariance and bias for general loss functions \u201d","volume":"51","author":"James","year":"2003","journal-title":"Machine Learning"},{"key":"2026033012244512800_ref151","volume-title":"arXiv:1810.08647","author":"Jaques","year":"2018"},{"issue":"3","key":"2026033012244512800_ref152","doi-asserted-by":"crossref","first-page":"496","DOI":"10.1214\/aos\/1176342415","article-title":"\u201cMarkov decision processes with a new optimality criterion: Discrete time \u201d","volume":"1","author":"Jaquette","year":"1973","journal-title":"The Annals of Statistics"},{"key":"2026033012244512800_ref153","first-page":"179","volume-title":"Proceedings of the 32nd International Conference on Machine Learning (ICML-15)","author":"Jiang","year":"2015"},{"key":"2026033012244512800_ref154","first-page":"1181","volume-title":"Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems","author":"Jiang","year":"2015"},{"key":"2026033012244512800_ref155","first-page":"652","volume-title":"Proceedings of The 33rd International Conference on Machine Learning","author":"Jiang","year":"2016"},{"key":"2026033012244512800_ref156","article-title":"\u201cInferring and Executing Programs for Visual Reasoning \u201d","author":"Johnson","year":"2017","journal-title":"arXiv:1705.03633"},{"key":"2026033012244512800_ref157","first-page":"4246","volume-title":"IJCAI","author":"Johnson","year":"2016"},{"key":"2026033012244512800_ref158","volume-title":"arXiv:1809.02627","author":"Juliani","year":"2018"},{"issue":"1","key":"2026033012244512800_ref159","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1016\/S0004-3702(98)00023-X","article-title":"\u201cPlanning and acting in partially observable stochastic domains \u201d","volume":"101","author":"Kaelbling","year":"1998","journal-title":"Artificial intel ligence"},{"key":"2026033012244512800_ref160","volume-title":"Thinking, fast and slow","author":"Kahneman","year":"2011"},{"key":"2026033012244512800_ref161","first-page":"1531","volume-title":"Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada]","author":"Kakade","year":"2001"},{"key":"2026033012244512800_ref162","first-page":"306","article-title":"\u201cExploration in metric state spaces \u201d","volume":"3","author":"Kakade","year":"2003","journal-title":"ICML"},{"key":"2026033012244512800_ref163","first-page":"1331","volume-title":"Robotics and Automation (ICRA), 2013 IEEE International Conference on","author":"Kalakrishnan","year":"2013"},{"key":"2026033012244512800_ref164","volume-title":"arXiv:1806.10293","author":"Kalashnikov","year":"2018"},{"key":"2026033012244512800_ref165","volume-title":"arXiv:1610.00527","author":"Kalchbrenner","year":"2016"},{"key":"2026033012244512800_ref166","volume-title":"arXiv:1706.04317","author":"Kansky","year":"2017"},{"key":"2026033012244512800_ref167","volume-title":"arXiv:1704.05539","author":"Kaplan","year":"2017"},{"issue":"2-3","key":"2026033012244512800_ref168","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1023\/A:1017984413808","article-title":"\u201cNear-optimal reinforcement learning in polynomial time \u201d","volume":"49","author":"Kearns","year":"2002","journal-title":"Machine Learning"},{"key":"2026033012244512800_ref169","first-page":"1","volume-title":"Computational Intelligence and Games (CIG), 2016 IEEE Conference on","author":"Kempka","year":"2016"},{"key":"2026033012244512800_ref170","volume-title":"arXiv:1612.00796","author":"Kirkpatrick","year":"2016"},{"key":"2026033012244512800_ref171","volume-title":"arXiv:1706.02515","author":"Klambauer","year":"2017"},{"key":"2026033012244512800_ref172","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1145\/1553374.1553441","volume-title":"Proceedings of the 26th Annual International Conference on Machine Learning. ACM","author":"Kolter","year":"2009"},{"key":"2026033012244512800_ref173","first-page":"1008","volume-title":"Advances in neural information processing systems","author":"Konda","year":"2000"},{"key":"2026033012244512800_ref174","first-page":"1097","volume-title":"Advances in neural information processing systems","author":"Krizhevsky","year":"2012"},{"key":"2026033012244512800_ref175","first-page":"324","volume-title":"Machine Learning and Applications, 2009. ICMLA x2019;09. International Conference on","author":"Kroon","year":"2009"},{"key":"2026033012244512800_ref176","first-page":"3675","volume-title":"Advances in Neural Information Processing Systems","author":"Kulkarni","year":"2016"},{"key":"2026033012244512800_ref177","first-page":"2140","volume-title":"AAAI","author":"Lample","year":"2017"},{"issue":"10","key":"2026033012244512800_ref178","first-page":"1995","article-title":"\u201cConvolutional networks for images, speech, and time series \u201d","volume":"3361","author":"LeCun","year":"1995","journal-title":"The handbook of brain theory and neural networks"},{"issue":"7553","key":"2026033012244512800_ref179","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"\u201cDeep learning \u201d","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"issue":"11","key":"2026033012244512800_ref180","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"\u201cGradient-based learning applied to document recognition \u201d","volume":"86","author":"LeCun","year":"1998","journal-title":"Proceedings of the IEEE"},{"key":"2026033012244512800_ref181","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1146\/annurev-neuro-062111-150512","article-title":"\u201cNeural basis of reinforcement learning and decision making \u201d","volume":"35","author":"Lee","year":"2012","journal-title":"Annual review of neuroscience"},{"key":"2026033012244512800_ref182","first-page":"572","article-title":"\u201cEfficient reinforcement learning with relocatable action models \u201d","volume":"7","author":"Leffler","year":"2007","journal-title":"AAAI"},{"issue":"39","key":"2026033012244512800_ref183","first-page":"1","article-title":"\u201cEnd-to-end training of deep visuomotor policies \u201d","volume":"17","author":"Levine","year":"2016","journal-title":"Journal of Machine Learning Research"},{"key":"2026033012244512800_ref184","first-page":"1","volume-title":"International Conference on Machine Learning","author":"Levine","year":"2013"},{"issue":"3","key":"2026033012244512800_ref185","doi-asserted-by":"crossref","first-page":"247","DOI":"10.1109\/JAS.2016.7508798","article-title":"\u201cTraffic signal timing via deep reinforcement learning \u201d","volume":"3","author":"Li","year":"2016","journal-title":"IEEE\/CAA Journal of Automatica Sinica"},{"key":"2026033012244512800_ref186","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1145\/1935826.1935878","volume-title":"Proceedings of the fourth ACM international conference on Web search and data mining","author":"Li","year":"2011"},{"key":"2026033012244512800_ref187","volume-title":"arXiv:1509.03044","author":"Li","year":"2015"},{"issue":"3","key":"2026033012244512800_ref188","first-page":"18","article-title":"\u201cClassification and regression by randomForest \u201d","volume":"2","author":"Liaw","year":"2002","journal-title":"R news"},{"key":"2026033012244512800_ref189","volume-title":"arXiv:1509.02971","author":"Lillicrap","year":"2015"},{"issue":"3-4","key":"2026033012244512800_ref190","doi-asserted-by":"crossref","first-page":"293","DOI":"10.1023\/A:1022628806385","article-title":"\u201cSelf-improving reactive agents based on reinforcement learning, planning and teaching \u201d","volume":"8","author":"Lin","year":"1992","journal-title":"Machine learning"},{"key":"2026033012244512800_ref191","volume-title":"arXiv:1608.05081","author":"Lipton","year":"2016"},{"key":"2026033012244512800_ref192","first-page":"157","article-title":"\u201cMarkov games as a framework for multi-agent reinforcement learning \u201d","volume":"157","author":"Littman","year":"1994","journal-title":"Proceedings of the eleventh international conference on machine learning"},{"key":"2026033012244512800_ref193","volume-title":"arXiv:1707.03374","author":"Liu","year":"2017"},{"key":"2026033012244512800_ref194","volume-title":"arXiv:1706.02275","author":"Lowe","year":"2017"},{"key":"2026033012244512800_ref195","volume-title":"arXiv:1701.06049","author":"MacGlashan","year":"2017"},{"key":"2026033012244512800_ref196","volume-title":"arXiv:1703.00956","author":"Machado","year":"2017"},{"key":"2026033012244512800_ref197","volume-title":"arXiv:1709.06009","author":"Machado","year":"2017"},{"key":"2026033012244512800_ref198","first-page":"1077","volume-title":"Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems","author":"Mandel","year":"2014"},{"key":"2026033012244512800_ref199","first-page":"1588","volume-title":"Advances in Neural Information Processing Systems","author":"Mankowitz","year":"2016"},{"key":"2026033012244512800_ref200","volume-title":"arXiv:1511.05440","author":"Mathieu","year":"2015"},{"key":"2026033012244512800_ref201","volume-title":"arXiv:1707.00183","author":"Matiisen","year":"2017"},{"key":"2026033012244512800_ref202","volume-title":"PhD thesis","author":"McCallum","year":"1996"},{"key":"2026033012244512800_ref203","article-title":"\u201cRoles of macroactions in accelerating reinforcement learning \u201d","volume":"1317","author":"McGovern","year":"1997","journal-title":"Grace Hopper celebration of women in computing"},{"key":"2026033012244512800_ref204","volume-title":"arXiv:1703.00548","author":"Miikkulainen","year":"2017"},{"key":"2026033012244512800_ref205","volume-title":"arXiv:1611.03673","author":"Mirowski","year":"2016"},{"key":"2026033012244512800_ref206","volume-title":"International Conference on Machine Learning","author":"Mnih","year":"2016"},{"issue":"7540","key":"2026033012244512800_ref207","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1038\/nature14236","article-title":"\u201cHuman-level control through deep reinforcement learning \u201d","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"key":"2026033012244512800_ref208","first-page":"2125","volume-title":"Advances in neural information processing systems","author":"Mohamed","year":"2015"},{"key":"2026033012244512800_ref209","doi-asserted-by":"crossref","first-page":"271","DOI":"10.1007\/978-1-4614-1424-7_13","volume-title":"20 Years of Computational Neuroscience","author":"Montague","year":"2013"},{"key":"2026033012244512800_ref210","volume-title":"\u201cEfficient memory-based learning for robot control \u201d","author":"Moore","year":"1990"},{"issue":"4-5","key":"2026033012244512800_ref211","doi-asserted-by":"crossref","first-page":"667","DOI":"10.1016\/S0098-1354(98)00301-9","article-title":"\u201cModel predictive control: past, present and future \u201d","volume":"23","author":"Morari","year":"1999","journal-title":"Computers Chemical Engineering"},{"issue":"6337","key":"2026033012244512800_ref212","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1126\/science.aam6960","article-title":"\u201cDeepStack: Expert-level artificial intelligence in heads-up no-limit poker \u201d","volume":"356","author":"Morav\u010dik","year":"2017","journal-title":"Science"},{"key":"2026033012244512800_ref213","first-page":"3132","volume-title":"Advances in Neural Information Processing Systems","author":"Mordatch","year":"2015"},{"key":"2026033012244512800_ref214","first-page":"799","volume-title":"Proceedings of the 27th International Conference on Machine Learning (ICML-10)","author":"Morimura","year":"2010"},{"issue":"2","key":"2026033012244512800_ref215","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1023\/A:1017992615625","article-title":"\u201cVariable resolution discretization in optimal control \u201d","volume":"49","author":"Munos","year":"2002","journal-title":"Machine learning"},{"key":"2026033012244512800_ref216","first-page":"1046","volume-title":"Advances in Neural Information Processing Systems","author":"Munos","year":"2016"},{"key":"2026033012244512800_ref217","volume-title":"\u201cMachine Learning: A Probabilistic Perspective. \u201d","author":"Murphy","year":"2012"},{"key":"2026033012244512800_ref218","volume-title":"arXiv:1708.02596","author":"Nagabandi","year":"2017"},{"key":"2026033012244512800_ref219","first-page":"7559","volume-title":"2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE","author":"NagabandiA","year":"2018"},{"key":"2026033012244512800_ref220","first-page":"566","volume-title":"Proceedings of the 2016 International Conference on Autonomous Agents Multiagent Systems","author":"Narvekar","year":"2016"},{"key":"2026033012244512800_ref221","volume-title":"arXiv:1511.04834","author":"Neelakantan","year":"2015"},{"key":"2026033012244512800_ref222","volume-title":"arXiv:1206.5264","author":"NeuG","year":"2012"},{"issue":"99","key":"2026033012244512800_ref223","first-page":"278","volume":"l","author":"Ng","year":"1999","journal-title":"ICML"},{"key":"2026033012244512800_ref224","first-page":"663","volume-title":"Icml","author":"Ng","year":"2000"},{"issue":"3","key":"2026033012244512800_ref225","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1109\/37.55119","volume":"10","author":"Nguyen","year":"1990","journal-title":"IEEE Control systems magazine"},{"issue":"3","key":"2026033012244512800_ref226","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1016\/j.jmp.2008.12.005","article-title":"\u201cReinforcement learning in the brain \u201d","volume":"53","author":"Niv","year":"2009","journal-title":"Journal of Mathematical Psychology"},{"key":"2026033012244512800_ref227","doi-asserted-by":"crossref","first-page":"331","DOI":"10.1016\/B978-0-12-374176-9.00022-1","volume-title":"Neuroeconomics. Elsevier","author":"Niv","year":"2009"},{"key":"2026033012244512800_ref228","volume-title":"Markov chains. No. 2","author":"Norris","year":"1998"},{"key":"2026033012244512800_ref229","volume-title":"arXiv:1611.01626","author":"O x2019;Donoghue","year":"2016"},{"key":"2026033012244512800_ref230","volume-title":"arXiv:1605.09128","author":"Oh","year":"2016"},{"key":"2026033012244512800_ref231","first-page":"2863","volume-title":"Advances in Neural Information Processing Systems","author":"Oh","year":"2015"},{"key":"2026033012244512800_ref232","volume-title":"arXiv:1707.03497","author":"Oh","year":"2017"},{"key":"2026033012244512800_ref233","volume-title":"Distill","author":"Olah","year":"2017"},{"key":"2026033012244512800_ref234","doi-asserted-by":"crossref","first-page":"140","DOI":"10.1007\/978-3-319-11662-4_11","volume-title":"International Conference on Algorithmic Learning Theory. Springer","author":"Ortner","year":"2014"},{"key":"2026033012244512800_ref235","volume-title":"arXiv:1602.04621","author":"Osband","year":"2016"},{"key":"2026033012244512800_ref236","volume-title":"arXiv:1703.01310","author":"Ostrovski","year":"2017"},{"key":"2026033012244512800_ref237","volume-title":"arXiv:1810.05017","author":"Paine","year":"2018"},{"key":"2026033012244512800_ref238","volume-title":"arXiv:1511.06342","author":"Parisotto","year":"2015"},{"key":"2026033012244512800_ref239","volume-title":"arXiv:1707.06170","author":"Pascanu","year":"2017"},{"key":"2026033012244512800_ref240","volume":"2017","author":"Pathak","year":"2017","journal-title":"International Conference on Machine Learning (ICML)"},{"key":"2026033012244512800_ref241","volume-title":"Conditioned reflexes","author":"Pavlov","year":"1927"},{"key":"2026033012244512800_ref242","first-page":"2825","volume-title":"Journal of Machine Learning Research","author":"Pedregosa","year":"2011"},{"key":"2026033012244512800_ref243","doi-asserted-by":"crossref","first-page":"226","DOI":"10.1016\/B978-1-55860-335-6.50035-0","volume-title":"Machine Learning Proceedings 1994. Elsevier","author":"Peng","year":"1994"},{"key":"2026033012244512800_ref244","volume-title":"arXiv:1703.10069","author":"Peng","year":"2017"},{"issue":"4","key":"2026033012244512800_ref245","article-title":"\u201cDeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning \u201d","volume":"36","author":"Peng","year":"2017","journal-title":"ACM Transactions on Graphics (Proc. SIGGRAPH 2017)"},{"issue":"3","key":"2026033012244512800_ref246","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1109\/TCIAIG.2015.2402393","article-title":"\u201cThe 2014 general video game playing competition \u201d","volume":"8","author":"Perez-Liebana","year":"2016","journal-title":"IEEE Transactions on Computational Intelligence and AI in Games"},{"key":"2026033012244512800_ref247","first-page":"1265","volume-title":"Advances in neural information processing systems","author":"Petrik","year":"2009"},{"key":"2026033012244512800_ref248","volume-title":"\u201cCapital in the Twenty-First Century \u201d","author":"Piketty","year":"2013"},{"key":"2026033012244512800_ref249","first-page":"1025","article-title":"\u201cPoint-based value iteration: An anytime algorithm for POMDPs \u201d","volume":"3","author":"Pineau","year":"2003","journal-title":"IJCAI"},{"key":"2026033012244512800_ref250","volume-title":"arXiv:1710.06542","author":"Pinto","year":"2017"},{"key":"2026033012244512800_ref251","volume-title":"arXiv:1706.01905","author":"Plappert","year":"2017"},{"key":"2026033012244512800_ref252","first-page":"80","volume-title":"Computer Science Department Faculty Publication Series","author":"Precup","year":"2000"},{"key":"2026033012244512800_ref253","volume-title":"arXiv:1511.06732","author":"Ranzato","year":"2015"},{"key":"2026033012244512800_ref254","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1007\/978-3-540-28650-9_4","volume-title":"Advanced lectures on machine learning","author":"Rasmussen","year":"2004"},{"key":"2026033012244512800_ref255","volume-title":"PhD thesis","author":"Ravindran","year":"2004"},{"key":"2026033012244512800_ref256","volume-title":"arXiv:1703.01041","author":"Real","year":"2017"},{"key":"2026033012244512800_ref257","volume-title":"arXiv:1511.06279","author":"Reed","year":"2015"},{"key":"2026033012244512800_ref258","first-page":"64","article-title":"\u201cA theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement \u201d","volume":"2","author":"Rescorla","year":"1972","journal-title":"Classical conditioning II: Current research and theory"},{"key":"2026033012244512800_ref259","doi-asserted-by":"crossref","first-page":"317","DOI":"10.1007\/11564096_32","volume-title":"Machine Learning: ECML 2005","author":"Riedmiller","year":"2005"},{"key":"2026033012244512800_ref260","volume-title":"arXiv:1802.10567","author":"Riedmiller","year":"2018"},{"key":"2026033012244512800_ref261","volume-title":"arXiv:1802.08163","author":"Rowland","year":"2018"},{"key":"2026033012244512800_ref262","volume-title":"arXiv:1706.05098","author":"Ruder","year":"2017"},{"issue":"3","key":"2026033012244512800_ref263","first-page":"1","article-title":"\u201cLearning representations by back-propagating errors \u201d","volume":"5","author":"Rumelhart","year":"1988","journal-title":"Cognitive modeling"},{"issue":"3","key":"2026033012244512800_ref264","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s11263-015-0816-y","article-title":"\u201cImagenet large scale visual recognition challenge \u201d","volume":"115","author":"Russakovsky","year":"2015","journal-title":"International Journal of Computer Vision"},{"key":"2026033012244512800_ref265","volume-title":"bioRxiv: 083857","author":"Russek","year":"2017"},{"key":"2026033012244512800_ref266","volume-title":"arXiv:1511.06295","author":"Rusu","year":"2015"},{"key":"2026033012244512800_ref267","volume-title":"arXiv:1610.04286","author":"Rusu","year":"2016"},{"key":"2026033012244512800_ref268","volume-title":"arXiv:1611.04201","author":"Sadeghi","year":"2016"},{"issue":"5","key":"2026033012244512800_ref269","doi-asserted-by":"crossref","first-page":"2789","DOI":"10.3390\/e16052789","volume":"16","author":"Salge","year":"2014","journal-title":"Entropy"},{"key":"2026033012244512800_ref270","volume-title":"arXiv:1703.03864","author":"Salimans","year":"2017"},{"issue":"3","key":"2026033012244512800_ref271","doi-asserted-by":"crossref","first-page":"210","DOI":"10.1147\/rd.33.0210","article-title":"\u201cSome studies in machine learning using the game of checkers \u201d","volume":"3","author":"Samuel","year":"1959","journal-title":"IBM Journal of research and development"},{"issue":"10","key":"2026033012244512800_ref272","doi-asserted-by":"crossref","first-page":"100","DOI":"10.1371\/journal.pcbi.1003285","article-title":"\u201cTen simple rules for reproducible computational research \u201d","volume":"9","author":"Sandve","year":"2013","journal-title":"PLoS computational biology"},{"key":"2026033012244512800_ref273","volume-title":"arXiv:1706.01427","author":"Santoro","year":"2017"},{"key":"2026033012244512800_ref274","volume-title":"arXiv:1810.02274","author":"Savinov","year":"2018"},{"key":"2026033012244512800_ref275","volume-title":"\u201cTensorForce: A TensorFlow library for applied reinforcement learning \u201d","author":"Schaarschmidt","year":"2017"},{"key":"2026033012244512800_ref276","first-page":"743","article-title":"\u201cPyBrain \u201d","volume":"11","author":"Schaul","year":"2010","journal-title":"The Journal of Machine Learning Research"},{"key":"2026033012244512800_ref277","first-page":"1312","volume-title":"Proceedings of the 32nd International Conference on Machine Learning (ICML-15)","author":"Schaul","year":"2015"},{"key":"2026033012244512800_ref278","volume-title":"arXiv:1511.05952","author":"Schaul","year":"2015"},{"issue":"3","key":"2026033012244512800_ref279","doi-asserted-by":"crossref","first-page":"230","DOI":"10.1109\/TAMD.2010.2056368","article-title":"\u201cFormal theory of creativity, fun, and intrinsic motivation (1990-2010) \u201d","volume":"2","author":"Schmidhuber","year":"2010","journal-title":"IEEE Transactions on Autonomous Mental Development"},{"key":"2026033012244512800_ref280","doi-asserted-by":"crossref","first-page":"85","DOI":"10.1016\/j.neunet.2014.09.003","article-title":"\u201cDeep learning in neural networks: An overview \u201d","volume":"61","author":"Schmidhuber","year":"2015","journal-title":"Neural Networks"},{"key":"2026033012244512800_ref281","first-page":"817","volume-title":"Advances in Neural Information Processing Systems","author":"Schraudolph","year":"1994"},{"key":"2026033012244512800_ref282","volume-title":"arXiv:1704.06440","author":"Schulman","year":"2017"},{"key":"2026033012244512800_ref283","doi-asserted-by":"crossref","first-page":"339","DOI":"10.1007\/978-3-319-28872-7_20","volume-title":"Robotics Research","author":"Schulman","year":"2016"},{"key":"2026033012244512800_ref284","first-page":"1889","volume-title":"ICML","author":"Schulman","year":"2015"},{"key":"2026033012244512800_ref285","volume-title":"arXiv:1707.06347","author":"Schulman","year":"2017"},{"issue":"5306","key":"2026033012244512800_ref286","doi-asserted-by":"crossref","first-page":"1593","DOI":"10.1126\/science.275.5306.1593","article-title":"\u201cA neural substrate of prediction and reward \u201d","volume":"275","author":"Schultz","year":"1997","journal-title":"Science"},{"issue":"314","key":"2026033012244512800_ref287","article-title":"\u201cProgramming a Computer for Playing Chess \u201d","volume":"41","author":"Shannon","year":"1950","journal-title":"Philosophical Magazine"},{"issue":"05","key":"2026033012244512800_ref288","article-title":"\u201cLifelong Machine Learning Systems: Beyond Learning Algorithms. \u201d","volume":"13","author":"Silver","year":"2013","journal-title":"AAAI Spring Symposium: Lifelong Machine Learning"},{"key":"2026033012244512800_ref289","volume-title":"arXiv:1612.08810","author":"Silver","year":"2016"},{"issue":"7587","key":"2026033012244512800_ref290","doi-asserted-by":"crossref","first-page":"484","DOI":"10.1038\/nature16961","article-title":"\u201cMastering the game of Go with deep neural networks and tree search \u201d","volume":"529","author":"Silver","year":"2016","journal-title":"Nature"},{"key":"2026033012244512800_ref291","volume-title":"ICML","author":"Silver","year":"2014"},{"key":"2026033012244512800_ref292","first-page":"284","volume-title":"ICML","author":"Singh","year":"1994"},{"issue":"1-3","key":"2026033012244512800_ref293","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1023\/A:1018012322525","article-title":"\u201cReinforcement learning with replacing eligibility traces \u201d","volume":"22","author":"Singh","year":"1996","journal-title":"Machine learning"},{"issue":"3","key":"2026033012244512800_ref294","doi-asserted-by":"crossref","first-page":"287","DOI":"10.1023\/A:1007678930559","article-title":"\u201cConvergence results for single-step on-policy reinforcement-learning algorithms \u201d","volume":"38","author":"Singh","year":"2000","journal-title":"Machine learning"},{"issue":"2","key":"2026033012244512800_ref295","doi-asserted-by":"crossref","first-page":"282","DOI":"10.1287\/opre.26.2.282","article-title":"\u201cThe optimal control of partially observable Markov processes over the infinite horizon: Discounted costs \u201d","volume":"26","author":"Sondik","year":"1978","journal-title":"Operations research"},{"issue":"1","key":"2026033012244512800_ref296","first-page":"1929","article-title":"\u201cDropout: a simple way to prevent neural networks from overfitting. \u201d","volume":"15","author":"Srivastava","year":"2014","journal-title":"Journal of Machine Learning Research"},{"key":"2026033012244512800_ref297","volume-title":"arXiv:1507.00814","author":"Stadie","year":"2015"},{"key":"2026033012244512800_ref298","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1007\/3-540-45164-1_38","volume-title":"Machine Learning: ECML 2000","author":"Stone","year":"2000"},{"key":"2026033012244512800_ref299","doi-asserted-by":"crossref","first-page":"76","DOI":"10.3389\/fnbeh.2014.00076","article-title":"\u201cDoes temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective \u201d","volume":"8","author":"Story","year":"2014","journal-title":"Frontiers in behavioral neuroscience"},{"key":"2026033012244512800_ref300","first-page":"2244","volume-title":"Advances in Neural Information Processing Systems","author":"Sukhbaatar","year":"2016"},{"key":"2026033012244512800_ref301","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1007\/978-3-642-22887-2_5","volume-title":"Artificial General Intelligence","author":"Sun","year":"2011"},{"key":"2026033012244512800_ref302","volume-title":"arXiv:1706.05296","author":"Sunehag","year":"2017"},{"issue":"1","key":"2026033012244512800_ref303","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1023\/A:1022633531479","article-title":"\u201cLearning to predict by the methods of temporal differences \u201d","volume":"3","author":"Sutton","year":"1988","journal-title":"Machine learning"},{"key":"2026033012244512800_ref304","first-page":"1038","volume-title":"Advances in neural information processing systems","author":"Sutton","year":"1996"},{"issue":"1","key":"2026033012244512800_ref305","volume":"1","author":"Sutton","year":"1998","journal-title":"Reinforcement learning: An introduction"},{"key":"2026033012244512800_ref306","volume-title":"Reinforcement Learning: An Introduction (2nd Edition, in progress)","author":"Sutton","year":"2017"},{"key":"2026033012244512800_ref307","first-page":"1057","volume-title":"Advances in neural information processing systems","author":"Sutton","year":"2000"},{"issue":"1-2","key":"2026033012244512800_ref308","doi-asserted-by":"crossref","first-page":"181","DOI":"10.1016\/S0004-3702(99)00052-1","article-title":"\u201cBetween MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning \u201d","volume":"112","author":"Sutton","year":"1999","journal-title":"Artificial intelligence"},{"key":"2026033012244512800_ref309","volume-title":"\u201cTemporal credit assignment in reinforcement learning \u201d","author":"Sutton","year":"1984"},{"key":"2026033012244512800_ref310","volume-title":"arXiv:1611.00625","author":"Synnaeve","year":"2016"},{"key":"2026033012244512800_ref311","volume-title":"arXiv:1602.07261","author":"Szegedy","year":"2016"},{"issue":"12","key":"2026033012244512800_ref312","article-title":"\u201cInception- v4, inception-resnet and the impact of residual connections on learning. \u201d","volume":"4","author":"Szegedy","year":"2017","journal-title":"AAAI"},{"key":"2026033012244512800_ref313","first-page":"2146","volume-title":"Advances in Neural Information Processing Systems","author":"Tamar","year":"2016"},{"key":"2026033012244512800_ref314","volume-title":"arXiv:1804.10332","author":"Tan","year":"2018"},{"key":"2026033012244512800_ref315","first-page":"2133","article-title":"\u201cRL-Glue: Language-independent software for reinforcement-learning experiments \u201d","volume":"10","author":"Tanner","year":"2009","journal-title":"The Journal of Machine Learning Research"},{"key":"2026033012244512800_ref316","volume-title":"arXiv:1707.04175","author":"Teh","year":"2017"},{"issue":"3","key":"2026033012244512800_ref317","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1145\/203330.203343","article-title":"\u201cTemporal difference learning and TD-Gammon \u201d","volume":"38","author":"Tesauro","year":"1995","journal-title":"Communications of the ACM"},{"key":"2026033012244512800_ref318","first-page":"1553","volume-title":"AAAI","author":"Tessler","year":"2017"},{"key":"2026033012244512800_ref319","first-page":"441","volume-title":"International Conference on Machine Learning","author":"Thomas","year":"2014"},{"key":"2026033012244512800_ref320","volume-title":"International Conference on Machine Learning","author":"Thomas","year":"2016"},{"key":"2026033012244512800_ref321","volume-title":"\u201cEfficient exploration in reinforcement learning \u201d","author":"Thrun","year":"1992"},{"key":"2026033012244512800_ref322","volume-title":"Advances in Neural Information Processing Systems (NIPS)","author":"Tian","year":"2017"},{"key":"2026033012244512800_ref323","volume-title":"COURSERA: Neural Networks for Machine Learning","author":"Tieleman","year":"2012"},{"key":"2026033012244512800_ref324","volume-title":"arXiv:1703.06907","author":"Tobin","year":"2017"},{"key":"2026033012244512800_ref325","first-page":"5026","volume-title":"Intelligent Robots and Systems (IROS), 2012 IEEE\/RSJ International Conference on","author":"Todorov","year":"2012"},{"issue":"5","key":"2026033012244512800_ref326","doi-asserted-by":"crossref","first-page":"674","DOI":"10.1109\/9.580874","article-title":"\u201cAn analysis of temporal difference learning with function approximation \u201d","volume":"42","author":"Tsitsiklis","year":"1997","journal-title":"Automatic Control, IEEE Transactions on"},{"key":"2026033012244512800_ref327","volume-title":"Faster than thought","author":"Turing","year":"1953"},{"key":"2026033012244512800_ref328","volume-title":"arXiv:1511.07111","author":"Tzeng","year":"2015"},{"key":"2026033012244512800_ref329","first-page":"198","volume-title":"First International Early Research Career Enhancement School on Biological ly Inspired Cognitive Architectures","author":"Ueno","year":"2017"},{"key":"2026033012244512800_ref330","first-page":"2094","volume-title":"AAAI","author":"Van Hasselt","year":"2016"},{"key":"2026033012244512800_ref331","volume-title":"\u201cStatistical learning theory. Adaptive and learning systems for signal processing, communications, and control \u201d","author":"Vapnik","year":"1998"},{"key":"2026033012244512800_ref332","volume-title":"arXiv:1706.03762","author":"Vaswani","year":"2017"},{"key":"2026033012244512800_ref333","first-page":"3486","volume-title":"Advances in Neural Information Processing Systems","author":"Vezhnevets","year":"2016"},{"key":"2026033012244512800_ref334","volume-title":"arXiv:1708.04782","author":"Vinyals","year":"2017"},{"key":"2026033012244512800_ref335","volume-title":"arXiv:1502.02251","author":"Wahlstr\u00f6m","year":"2015"},{"key":"2026033012244512800_ref336","volume-title":"It x2019;s Alive!: Artificial Intel ligence from the Logic Piano to Kil ler Robots","author":"Walsh","year":"2017"},{"key":"2026033012244512800_ref337","volume-title":"arXiv:1611.05763","author":"Wang","year":"2016"},{"key":"2026033012244512800_ref338","volume-title":"arXiv:1611.01224","author":"Wang","year":"2016"},{"key":"2026033012244512800_ref339","volume-title":"arXiv:1511.06581","author":"Wang","year":"2015"},{"key":"2026033012244512800_ref340","volume-title":"arXiv:1709.10163","author":"Warnell","year":"2017"},{"issue":"3-4","key":"2026033012244512800_ref341","first-page":"279","article-title":"\u201cQ-learning \u201d","volume":"8","author":"Watkins","year":"1992","journal-title":"Machine learning"},{"key":"2026033012244512800_ref342","volume-title":"PhD thesis","author":"Watkins","year":"1989"},{"key":"2026033012244512800_ref343","first-page":"2746","volume-title":"Advances in neural information processing systems","author":"Watter","year":"2015"},{"key":"2026033012244512800_ref344","volume-title":"arXiv:1707.06203","author":"Weber","year":"2017"},{"key":"2026033012244512800_ref345","first-page":"402","volume-title":"Computational Intelligence and Games (CIG), 2012 IEEE Conference on","author":"Wender","year":"2012"},{"key":"2026033012244512800_ref346","doi-asserted-by":"crossref","first-page":"120","DOI":"10.1109\/ADPRL.2011.5967363","volume-title":"Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), 2011 IEEE Symposium on. IEEE","author":"Whiteson","year":"2011"},{"issue":"3-4","key":"2026033012244512800_ref347","doi-asserted-by":"crossref","first-page":"229","DOI":"10.1023\/A:1022672621406","article-title":"\u201cSimple statistical gradient-following algorithms for connectionist reinforcement learning \u201d","volume":"8","author":"Williams","year":"1992","journal-title":"Machine learning"},{"key":"2026033012244512800_ref348","volume-title":"\u201cTraining agent for first-person shooter game with actor-critic curriculum learning \u201d","author":"Wu","year":"2016"},{"key":"2026033012244512800_ref349","first-page":"2048","volume-title":"International Conference on Machine Learning","author":"Xu","year":"2015"},{"key":"2026033012244512800_ref350","volume-title":"arXiv:1704.03952","author":"You","year":"2017"},{"key":"2026033012244512800_ref351","volume-title":"arXiv:1608.05742","author":"Zamora","year":"2016"},{"key":"2026033012244512800_ref352","volume-title":"arXiv:1806.07937","author":"Zhang","year":"2018"},{"key":"2026033012244512800_ref353","volume-title":"arXiv:1804.10689","author":"Zhang","year":"2018"},{"key":"2026033012244512800_ref354","volume-title":"arXiv:1804.06893","author":"Zhang","year":"2018"},{"key":"2026033012244512800_ref355","volume-title":"arXiv:1611.03530","author":"Zhang","year":"2016"},{"key":"2026033012244512800_ref356","volume-title":"arXiv:1609.05143","author":"Zhu","year":"2016"},{"key":"2026033012244512800_ref357","volume-title":"Modeling purposeful adaptive behavior with the principle of maximum causal entropy","author":"Ziebart","year":"2010"},{"key":"2026033012244512800_ref358","volume-title":"arXiv:1611.01578","author":"Zoph","year":"2016"}],"container-title":["Foundations and Trends\u00ae in Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/ftmal\/article-pdf\/11\/3-4\/219\/11155741\/2200000071en.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/www.emerald.com\/ftmal\/article-pdf\/11\/3-4\/219\/11155741\/2200000071en.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,30]],"date-time":"2026-03-30T16:25:44Z","timestamp":1774887944000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.emerald.com\/ftmal\/article\/11\/3-4\/219\/1332417\/An-Introduction-to-Deep-Reinforcement-Learning"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,12,20]]},"references-count":358,"journal-issue":{"issue":"3-4","published-print":{"date-parts":[[2018,12,20]]}},"URL":"https:\/\/doi.org\/10.1561\/2200000071","relation":{},"ISSN":["1935-8237","1935-8245"],"issn-type":[{"value":"1935-8237","type":"print"},{"value":"1935-8245","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,12,20]]}}}