{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T10:37:33Z","timestamp":1774003053439,"version":"3.50.1"},"reference-count":39,"publisher":"Institute for Operations Research and the Management Sciences (INFORMS)","issue":"2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Operations Research"],"published-print":{"date-parts":[[2026,3]]},"abstract":"<jats:p>In \u201cPost Reinforcement Learning Inference,\u201d Vasilis Syrgkanis and Ruohan Zhan develop a new inferential framework for data collected via reinforcement learning (RL) algorithms, the adaptive systems that update strategies as outcomes unfold. Traditional statistical methods fail in this setting because adaptivity induces time-varying variance and dependence across samples. Syrgkanis and Zhan propose an adaptively weighted generalized method of moments (AW-GMM) estimator that stabilizes this variance through data-dependent weights. They prove that the weighted estimator achieves consistency and asymptotic normality, enabling valid hypothesis testing and confidence intervals for policy values and dynamic treatment effects. Their method provides a unified approach for structural estimation and inference under nonstationary, adaptively generated sequence data, with applications to dynamic off-policy evaluation and personalized decision systems.<\/jats:p>","DOI":"10.1287\/opre.2024.1019","type":"journal-article","created":{"date-parts":[[2025,12,24]],"date-time":"2025-12-24T15:52:41Z","timestamp":1766591561000},"page":"917-957","source":"Crossref","is-referenced-by-count":0,"title":["Post Reinforcement Learning Inference"],"prefix":"10.1287","volume":"74","author":[{"given":"Vasilis","family":"Syrgkanis","sequence":"first","affiliation":[{"name":"Management Science and Engineering, Stanford University, Stanford, California 94305"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3426-2784","authenticated-orcid":false,"given":"Ruohan","family":"Zhan","sequence":"additional","affiliation":[{"name":"UCL School of Management, University College London, London E14 5AA, United Kingdom"}]}],"member":"109","reference":[{"key":"B1","unstructured":"Agrawal S, Goyal N (2013) Thompson sampling for contextual bandits with linear payoffs.\n                      Proc. Internat. Conf. Machine Learn.\n                      (PMLR, New York), 127\u2013135."},{"key":"B2","doi-asserted-by":"crossref","unstructured":"Baird L (1995) Residual algorithms: Reinforcement learning with function approximation.\n                      Machine Learn. Proc.\n                      (Elsevier, Amsterdam), 30\u201337.","DOI":"10.1016\/B978-1-55860-377-6.50013-X"},{"key":"B3","first-page":"462","volume":"34","author":"Barsov S","year":"1987","journal-title":"Doklady Math."},{"key":"B4","doi-asserted-by":"publisher","DOI":"10.1007\/s13226-010-0014-0"},{"key":"B5","first-page":"28548","volume":"34","author":"Bibaut A","year":"2021","journal-title":"Adv. Neural Inform. Processing Systems"},{"key":"B6","doi-asserted-by":"crossref","unstructured":"Cattaneo MD, Masini RP, Underwood WG (2025) Yurinskii\u2019s coupling for martingales.\n                      Annals Statist\n                      . 53(5):2179\u20132203.","DOI":"10.1214\/25-AOS2538"},{"key":"B7","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4614-7428-9_4"},{"key":"B8","doi-asserted-by":"crossref","unstructured":"Chen M, Beutel A, Covington P, Jain S, Belletti F, Chi EH (2019) Top-K off-policy correction for a REINFORCE recommender system.\n                      Proc. 12th ACM Internat. Conf. Web Search Data Mining\n                      (Association for Computing Machinery, New York), 456\u2013464.","DOI":"10.1145\/3289600.3290999"},{"key":"B9","doi-asserted-by":"publisher","DOI":"10.3982\/ECTA16294"},{"key":"B10","unstructured":"Chu W, Li L, Reyzin L, Schapire R (2011) Contextual bandits with linear payoff functions.\n                      Proc. 14th Internat. Conf. Artificial Intelligence Statist.\n                      (JMLR, Norfolk, MA), 208\u2013214."},{"key":"B11","doi-asserted-by":"crossref","unstructured":"Daskalakis C, Golowich N (2022) Fast rates for nonparametric online learning: From realizability to learning in games.\n                      Proc. 54th Annual ACM SIGACT Sympos. Theory Comput.\n                      (Association for Computing Machinery, New York), 846\u2013859.","DOI":"10.1145\/3519935.3519950"},{"key":"B12","unstructured":"Deshpande Y, Mackey L, Syrgkanis V, Taddy M (2018) Accurate inference for adaptive linear models.\n                      Proc. Internat. Conf. Machine Learn.\n                      (PMLR, New York), 1194\u20131203."},{"key":"B13","unstructured":"Devroye L, Mehrabian A, Reddad T (2018) The total variation distance between high-dimensional Gaussians with the same mean. Preprint, submitted October 19, https:\/\/arxiv.org\/abs\/1810.08693."},{"key":"B14","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.2014602118"},{"key":"B15","volume-title":"Martingale Limit Theory and Its Application","author":"Hall P","year":"2014"},{"key":"B16","doi-asserted-by":"publisher","DOI":"10.1162\/003465304323023651"},{"key":"B17","unstructured":"Kakade S, Langford J (2002) Approximately optimal approximate reinforcement learning.\n                      Proc. 19th Internat. Conf. Machine Learn.\n                      (Morgan Kaufmann Publishers, Burlington, MA), 267\u2013274."},{"key":"B18","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-21700-0"},{"key":"B19","doi-asserted-by":"publisher","DOI":"10.1146\/annurev-clinpsy-032511-143152"},{"key":"B20","unstructured":"Lewis G, Syrgkanis V (2020) Double\/debiased machine learning for dynamic treatment effects via g-estimation. Preprint, submitted February 17, https:\/\/arxiv.org\/abs\/2002.07285."},{"key":"B21","doi-asserted-by":"publisher","DOI":"10.1111\/j.1541-0420.2011.01738.x"},{"key":"B22","volume-title":"Causal Inference: What If","author":"Miguel A","year":"2023"},{"key":"B23","doi-asserted-by":"publisher","DOI":"10.1111\/1467-9868.00389"},{"key":"B24","doi-asserted-by":"publisher","DOI":"10.1002\/sim.2022"},{"key":"B25","first-page":"1392","volume":"33","author":"Neu G","year":"2020","journal-title":"Adv. Neural Inform. Processing Systems"},{"key":"B26","first-page":"1","author":"Neyman J","year":"1979","journal-title":"Sankhy\u0101: Indian J. Statist. Ser. A"},{"key":"B27","doi-asserted-by":"publisher","DOI":"10.1111\/ajps.12597"},{"key":"B28","unstructured":"Precup D (2000) Eligibility traces for off-policy policy evaluation.\n                      Proc. 17th Internat.\n                      (Morgan Kaufmann Publishers Inc., San Francisco, CA), 759\u2013766."},{"key":"B29","first-page":"1232","volume-title":"Proc. Conf. Learn. Theory","author":"Rakhlin A","year":"2014"},{"key":"B30","doi-asserted-by":"publisher","DOI":"10.1016\/0270-0255(86)90088-6"},{"key":"B31","doi-asserted-by":"crossref","unstructured":"Robins JM (2004) Optimal structural nested models for optimal sequential decisions.\n                      Proc. 2nd Seattle Sympos. Biostatist. Analysis Correlated Data\n                      (Springer, Berlin), 189\u2013326.","DOI":"10.1007\/978-1-4419-9076-1_11"},{"key":"B32","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/70.1.41"},{"key":"B33","first-page":"1417","volume-title":"Proc. Conf. Learn. Theory","author":"Russo D","year":"2016"},{"key":"B34","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9781107298019"},{"key":"B35","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.2022.2106868"},{"key":"B36","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.1998.712192"},{"key":"B37","doi-asserted-by":"publisher","DOI":"10.1515\/em-2015-0005"},{"key":"B38","doi-asserted-by":"crossref","unstructured":"Zhan R, Hadad V, Hirshberg DA, Athey S (2021) Off-policy evaluation via adaptive weighting with data from contextual bandits.\n                      Proc. 27th ACM SIGKDD Conf. Knowledge Discovery Data Mining\n                      (Association for Computing Machinery, New York), 2125\u20132135.","DOI":"10.1145\/3447548.3467456"},{"key":"B39","first-page":"7460","volume":"34","author":"Zhang K","year":"2021","journal-title":"Adv. Neural Inform. Processing Systems"}],"container-title":["Operations Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/pubsonline.informs.org\/doi\/pdf\/10.1287\/opre.2024.1019","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,20]],"date-time":"2026-03-20T08:12:47Z","timestamp":1773994367000},"score":1,"resource":{"primary":{"URL":"https:\/\/pubsonline.informs.org\/doi\/10.1287\/opre.2024.1019"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3]]},"references-count":39,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["10.1287\/opre.2024.1019"],"URL":"https:\/\/doi.org\/10.1287\/opre.2024.1019","relation":{},"ISSN":["0030-364X","1526-5463"],"issn-type":[{"value":"0030-364X","type":"print"},{"value":"1526-5463","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3]]}}}