{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,7]],"date-time":"2025-11-07T09:54:30Z","timestamp":1762509270721,"version":"3.41.2"},"reference-count":87,"publisher":"Association for Computing Machinery (ACM)","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. ACM"],"abstract":"<jats:p>A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a \u201clarge-sample\u201d regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory.<\/jats:p>\n          <jats:p>\n            We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of\n            <jats:monospace>MVP<\/jats:monospace>\n            (Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et\u00a0al. [82], achieves a regret on the order of (modulo log factors)\n            <jats:disp-formula>\n              <jats:tex-math notation=\"TeX\" version=\"MathJax\">\\begin{equation*} \n\\min \\big \\lbrace \\sqrt {SAH^3K}, \\,HK \\big \\rbrace,\n\\end{equation*}<\/jats:tex-math>\n            <\/jats:disp-formula>\n            where\n            <jats:italic>S<\/jats:italic>\n            is the number of states,\n            <jats:italic>A<\/jats:italic>\n            is the number of actions,\n            <jats:italic>H<\/jats:italic>\n            is the horizon length, and\n            <jats:italic>K<\/jats:italic>\n            is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size\n            <jats:italic>K<\/jats:italic>\n            \u2265 1, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield \u03b5-accuracy) of\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"TeX\" version=\"MathJaX\">\\(\\frac{SAH^3}{\\varepsilon ^2} \\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            up to log factor, which is minimax-optimal for the full \u03b5-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value\/cost and certain variances. The key technical innovation lies in a novel analysis paradigm (based on a new concept called \u201cprofiles\u201d) to decouple complicated statistical dependency across the sample trajectories \u2014 a long-standing challenge facing the analysis of online RL in the sample-starved regime.\n          <\/jats:p>","DOI":"10.1145\/3733592","type":"journal-article","created":{"date-parts":[[2025,5,2]],"date-time":"2025-05-02T11:51:18Z","timestamp":1746186678000},"update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Settling the Sample Complexity of Online Reinforcement Learning"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-2061-6801","authenticated-orcid":false,"given":"Zihan","family":"Zhang","sequence":"first","affiliation":[{"name":"Department of Electrical and Computer Engineering, Princeton University, Princeton, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9256-5815","authenticated-orcid":false,"given":"Yuxin","family":"Chen","sequence":"additional","affiliation":[{"name":"Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0064-7800","authenticated-orcid":false,"given":"Jason","family":"Lee","sequence":"additional","affiliation":[{"name":"Department of Electrical and Computer Engineering, Princeton University, Princeton, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0056-8299","authenticated-orcid":false,"given":"Simon S.","family":"Du","sequence":"additional","affiliation":[{"name":"Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, United States"}]}],"member":"320","published-online":{"date-parts":[[2025,5,2]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Conference on Learning Theory. 67\u201383","author":"Agarwal Alekh","year":"2020","unstructured":"Alekh Agarwal, Sham Kakade, and Lin\u00a0F Yang. 2020. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory. 67\u201383."},{"key":"e_1_2_1_2_1","volume-title":"Conference on Learning Theory. 4\u20137.","author":"Agarwal Alekh","year":"2017","unstructured":"Alekh Agarwal, Akshay Krishnamurthy, John Langford, Haipeng Luo, et\u00a0al. 2017. Open problem: First-order regret bounds for contextual bandits. In Conference on Learning Theory. 4\u20137."},{"key":"e_1_2_1_3_1","first-page":"I","article-title":"Optimistic posterior sampling for reinforcement learning: worst-case regret bounds","volume":"30","author":"Agrawal Shipra","year":"2017","unstructured":"Shipra Agrawal and Randy Jia. 2017. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems 30, I.\u00a0Guyon, U.\u00a0V. Luxburg, S.\u00a0Bengio, H.\u00a0Wallach, R.\u00a0Fergus, S.\u00a0Vishwanathan, and R.\u00a0Garnett (Eds.). Curran Associates, Inc., 1184\u20131194.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_4_1","volume-title":"International Conference on Machine Learning. 186\u2013194","author":"Allen-Zhu Zeyuan","year":"2018","unstructured":"Zeyuan Allen-Zhu, S\u00e9bastien Bubeck, and Yuanzhi Li. 2018. Make the minority great again: First-order regret bound for contextual bandits. In International Conference on Machine Learning. 186\u2013194."},{"key":"e_1_2_1_5_1","volume-title":"Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning 91, 3","author":"Azar Mohammad\u00a0Gheshlaghi","year":"2013","unstructured":"Mohammad\u00a0Gheshlaghi Azar, R\u00e9mi Munos, and Hilbert\u00a0J Kappen. 2013. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning 91, 3 (2013), 325\u2013349."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning. 263\u2013272","author":"Azar Mohammad\u00a0Gheshlaghi","year":"2017","unstructured":"Mohammad\u00a0Gheshlaghi Azar, Ian Osband, and R\u00e9mi Munos. 2017. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning. 263\u2013272."},{"key":"e_1_2_1_7_1","unstructured":"Yu Bai Tengyang Xie Nan Jiang and Yu-Xiang Wang. 2019. Provably efficient Q-learning with low switching cost. In Advances in Neural Information Processing Systems. 8004\u20138013."},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009)","author":"Bartlett L","year":"2009","unstructured":"Peter\u00a0L Bartlett and Ambuj Tewari. 2009. REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009))."},{"key":"e_1_2_1_9_1","volume-title":"Error bounds for constant step-size Q-learning. Systems & control letters 61, 12","author":"Beck L","year":"2012","unstructured":"Carolyn\u00a0L Beck and Rayadurgam Srikant. 2012. Error bounds for constant step-size Q-learning. Systems & control letters 61, 12 (2012), 1203\u20131208."},{"volume-title":"Reinforcement learning and optimal control","author":"Bertsekas Dimitri","key":"e_1_2_1_10_1","unstructured":"Dimitri Bertsekas. 2019. Reinforcement learning and optimal control. Athena Scientific."},{"key":"e_1_2_1_11_1","article-title":"R-max - a General Polynomial Time Algorithm\u00a0for Near-optimal Reinforcement Learning","author":"Brafman I.","year":"2003","unstructured":"Ronen\u00a0I. Brafman and Moshe Tennenholtz. 2003. R-max - a General Polynomial Time Algorithm\u00a0for Near-optimal Reinforcement Learning. J. Mach. Learn. Res. 3, Oct (March 2003), 213\u2013231.","journal-title":"J. Mach. Learn. Res. 3"},{"key":"e_1_2_1_12_1","unstructured":"Qi Cai Zhuoran Yang Chi Jin and Zhaoran Wang. 2019. Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830(2019)."},{"key":"e_1_2_1_13_1","volume-title":"Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. Advances in Neural Information Processing Systems 34","author":"Chen Liyu","year":"2021","unstructured":"Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, and Haipeng Luo. 2021. Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. Advances in Neural Information Processing Systems 34 (2021)."},{"key":"e_1_2_1_14_1","first-page":"8223","article-title":"Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes","volume":"33","author":"Chen Zaiwei","year":"2020","unstructured":"Zaiwei Chen, Siva\u00a0Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. 2020. Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. Advances in Neural Information Processing Systems 33 (2020), 8223\u20138234.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_15_1","unstructured":"Qiwen Cui and Lin\u00a0F Yang. 2021. Minimax sample complexity for turn-based stochastic game. In Uncertainty in Artificial Intelligence. 1496\u20131504."},{"key":"e_1_2_1_16_1","unstructured":"Christoph Dann and Emma Brunskill. 2015. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems. 2818\u20132826."},{"key":"e_1_2_1_17_1","volume-title":"Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems 30","author":"Dann Christoph","year":"2017","unstructured":"Christoph Dann, Tor Lattimore, and Emma Brunskill. 2017. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems 30 (2017)."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the 36th International Conference on Machine Learning. 1507\u20131516","author":"Dann Christoph","year":"2019","unstructured":"Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. 2019. Policy Certificates: Towards Accountable Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning. 1507\u20131516."},{"key":"e_1_2_1_19_1","first-page":"1","article-title":"Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning","volume":"34","author":"Dann Christoph","year":"2021","unstructured":"Christoph Dann, Teodor\u00a0Vanislavov Marinov, Mehryar Mohri, and Julian Zimmert. 2021. Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems 34 (2021), 1\u201312.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_20_1","unstructured":"Omar\u00a0Darwiche Domingues Pierre M\u00e9nard Emilie Kaufmann and Michal Valko. 2021. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory. 578\u2013598."},{"key":"e_1_2_1_21_1","unstructured":"Kefan Dong Yuanhao Wang Xiaoyu Chen and Liwei Wang. 2019. Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. arXiv preprint arXiv:1901.09311(2019)."},{"key":"e_1_2_1_22_1","volume-title":"Tight regret bounds for model-based reinforcement learning with greedy policies. Advances in Neural Information Processing Systems 32","author":"Efroni Yonathan","year":"2019","unstructured":"Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. 2019. Tight regret bounds for model-based reinforcement learning with greedy policies. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_2_1_23_1","first-page":"1","article-title":"Learning rates for Q-learning","author":"Even-Dar Eyal","year":"2003","unstructured":"Eyal Even-Dar and Yishay Mansour. 2003. Learning rates for Q-learning. Journal of Machine Learning Research 5, Dec (2003), 1\u201325.","journal-title":"Journal of Machine Learning Research 5"},{"key":"e_1_2_1_24_1","volume-title":"On tail probabilities for martingales. the Annals of Probability 3, 1","author":"Freedman A","year":"1975","unstructured":"David\u00a0A Freedman. 1975. On tail probabilities for martingales. the Annals of Probability 3, 1 (1975), 100\u2013118."},{"key":"e_1_2_1_25_1","volume-title":"Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. In ICML 2018-The 35th International Conference on Machine Learning, Vol.\u00a0 80","author":"Fruit Ronan","year":"2018","unstructured":"Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. 2018. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. In ICML 2018-The 35th International Conference on Machine Learning, Vol.\u00a0 80. 1578\u20131586."},{"key":"e_1_2_1_26_1","first-page":"1563","article-title":"Near-optimal regret bounds for reinforcement learning","author":"Jaksch Thomas","year":"2010","unstructured":"Thomas Jaksch, Ronald Ortner, and Peter Auer. 2010. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11, Apr (2010), 1563\u20131600.","journal-title":"Journal of Machine Learning Research 11"},{"key":"e_1_2_1_27_1","volume-title":"Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time. Advances in neural information processing systems","author":"Ji Xiang","year":"2023","unstructured":"Xiang Ji and Gen Li. 2023. Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time. Advances in neural information processing systems (2023)."},{"key":"e_1_2_1_28_1","volume-title":"Conference On Learning Theory. 3395\u20133398","author":"Jiang Nan","year":"2018","unstructured":"Nan Jiang and Alekh Agarwal. 2018. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory. 3395\u20133398."},{"key":"e_1_2_1_29_1","unstructured":"Chi Jin Zeyuan Allen-Zhu Sebastien Bubeck and Michael\u00a0I Jordan. 2018. Is Q-learning provably efficient?. In Advances in Neural Information Processing Systems. 4863\u20134873."},{"key":"e_1_2_1_30_1","volume-title":"Reward-Free Exploration for Reinforcement Learning. International Conference on Machine Learning","author":"Jin Chi","year":"2020","unstructured":"Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. 2020. Reward-Free Exploration for Reinforcement Learning. International Conference on Machine Learning (2020)."},{"key":"e_1_2_1_31_1","volume-title":"International Conference on Machine Learning. 5084\u20135096","author":"Jin Ying","year":"2021","unstructured":"Ying Jin, Zhuoran Yang, and Zhaoran Wang. 2021. Is pessimism provably efficient for offline RL?. In International Conference on Machine Learning. 5084\u20135096."},{"volume-title":"On the sample complexity of reinforcement learning. Ph.\u00a0D. Dissertation","author":"Kakade M","key":"e_1_2_1_32_1","unstructured":"Sham\u00a0M Kakade. 2003. On the sample complexity of reinforcement learning. Ph.\u00a0D. Dissertation. University of London London, England."},{"key":"e_1_2_1_33_1","volume-title":"Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in neural information processing systems 11","author":"Kearns Michael","year":"1998","unstructured":"Michael Kearns and Satinder Singh. 1998. Finite-sample convergence rates for Q-learning and indirect algorithms. Advances in neural information processing systems 11 (1998)."},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the Fifteenth International Conference on Machine Learning. 260\u2013268","author":"Kearns J","year":"1998","unstructured":"Michael\u00a0J Kearns and Satinder\u00a0P Singh. 1998. Near-Optimal Reinforcement Learning in Polynominal Time. In Proceedings of the Fifteenth International Conference on Machine Learning. 260\u2013268."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553441"},{"key":"e_1_2_1_36_1","volume-title":"International Conference on Algorithmic Learning Theory. Springer, 320\u2013334","author":"Lattimore Tor","year":"2012","unstructured":"Tor Lattimore and Marcus Hutter. 2012. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory. Springer, 320\u2013334."},{"key":"e_1_2_1_37_1","volume-title":"Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems 33","author":"Lee Chung-Wei","year":"2020","unstructured":"Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. 2020. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems 33 (2020), 15522\u201315533."},{"key":"e_1_2_1_38_1","unstructured":"Sergey Levine Aviral Kumar George Tucker and Justin Fu. 2020. Offline reinforcement learning: Tutorial review and perspectives on open problems. arXiv preprint arXiv:2005.01643(2020)."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1287\/opre.2023.2450"},{"key":"e_1_2_1_40_1","first-page":"15353","article-title":"Minimax-optimal multi-agent RL in Markov games with a generative model","volume":"35","author":"Li Gen","year":"2022","unstructured":"Gen Li, Yuejie Chi, Yuting Wei, and Yuxin Chen. 2022. Minimax-optimal multi-agent RL in Markov games with a generative model. Advances in Neural Information Processing Systems 35 (2022), 15353\u201315367.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_41_1","first-page":"233","article-title":"Settling the sample complexity of model-based offline reinforcement learning","volume":"52","author":"Li Gen","year":"2024","unstructured":"Gen Li, Laixi Shi, Yuxin Chen, Yuejie Chi, and Yuting Wei. 2024. Settling the sample complexity of model-based offline reinforcement learning. The Annals of Statistics 52, 1 (2024), 233\u2013260.","journal-title":"The Annals of Statistics"},{"key":"e_1_2_1_42_1","volume-title":"Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Advances in Neural Information Processing Systems 34","author":"Li Gen","year":"2021","unstructured":"Gen Li, Laixi Shi, Yuxin Chen, Yuantao Gu, and Yuejie Chi. 2021. Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Advances in Neural Information Processing Systems 34 (2021)."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1287\/opre.2023.2451"},{"key":"e_1_2_1_44_1","first-page":"448","article-title":"Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction","volume":"68","author":"Li Gen","year":"2021","unstructured":"Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. 2021. Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Transactions on Information Theory 68, 1 (2021), 448\u2013473.","journal-title":"IEEE Transactions on Information Theory"},{"key":"e_1_2_1_45_1","volume-title":"Conference on Learning Theory (COLT)(2024)","author":"Li Gen","year":"2024","unstructured":"Gen Li, Yuling Yan, Yuxin Chen, and Jianqing Fan. 2024. Minimax-optimal reward-agnostic exploration in reinforcement learning. Conference on Learning Theory (COLT)(2024)."},{"key":"e_1_2_1_46_1","volume-title":"Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning. Advances in Neural Information Processing Systems 36","author":"Li Gen","year":"2024","unstructured":"Gen Li, Wenhao Zhan, Jason\u00a0D Lee, Yuejie Chi, and Yuxin Chen. 2024. Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning. Advances in Neural Information Processing Systems 36 (2024)."},{"key":"e_1_2_1_47_1","volume-title":"Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning. In IEEE Symposium on Foundations of Computer Science.","author":"Li Yuanzhi","year":"2021","unstructured":"Yuanzhi Li, Ruosong Wang, and Lin\u00a0F Yang. 2021. Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning. In IEEE Symposium on Foundations of Computer Science."},{"key":"e_1_2_1_48_1","volume-title":"Conference on Learning Theory.","author":"Maurer Andreas","year":"2009","unstructured":"Andreas Maurer and Massimiliano Pontil. 2009. Empirical Bernstein bounds and sample variance penalization. In Conference on Learning Theory."},{"key":"e_1_2_1_49_1","volume-title":"International Conference on Machine Learning. 7609\u20137618","author":"M\u00e9nard Pierre","year":"2021","unstructured":"Pierre M\u00e9nard, Omar\u00a0Darwiche Domingues, Xuedong Shang, and Michal Valko. 2021. UCB Momentum Q-learning: Correcting the bias without forgetting. In International Conference on Machine Learning. 7609\u20137618."},{"key":"e_1_2_1_50_1","unstructured":"Gergely Neu and Ciara Pike-Burke. 2020. A Unifying View of Optimism in Episodic Reinforcement Learning. arXiv preprint arXiv:2007.01891(2020)."},{"key":"e_1_2_1_51_1","unstructured":"Ian Osband Daniel Russo and Benjamin Van\u00a0Roy. 2013. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems. 3003\u20133011."},{"key":"e_1_2_1_52_1","unstructured":"Aldo Pacchiano Philip Ball Jack Parker-Holder Krzysztof Choromanski and Stephen Roberts. 2020. On Optimism in Model-Based Reinforcement Learning. arXiv preprint arXiv:2006.11911(2020)."},{"key":"e_1_2_1_53_1","doi-asserted-by":"crossref","first-page":"566","DOI":"10.1109\/TIT.2020.3027316","article-title":"Instance-dependent \u2113\u221e-bounds for policy evaluation in tabular reinforcement learning","volume":"67","author":"Pananjady Ashwin","year":"2020","unstructured":"Ashwin Pananjady and Martin\u00a0J Wainwright. 2020. Instance-dependent \u2113\u221e-bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory 67, 1 (2020), 566\u2013585.","journal-title":"IEEE Transactions on Information Theory"},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of Thirty Third Conference on Learning Theory (COLT).","author":"Qu Guannan","year":"2020","unstructured":"Guannan Qu and Adam Wierman. 2020. Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning. In Proceedings of Thirty Third Conference on Learning Theory (COLT)."},{"key":"e_1_2_1_55_1","first-page":"11702","article-title":"Bridging offline reinforcement learning and imitation learning: A tale of pessimism","volume":"34","author":"Rashidinejad Paria","year":"2021","unstructured":"Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. 2021. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems 34 (2021), 11702\u201311716.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_56_1","volume-title":"Nearly horizon-free offline reinforcement learning. Advances in neural information processing systems 34","author":"Ren Tongzheng","year":"2021","unstructured":"Tongzheng Ren, Jialian Li, Bo Dai, Simon\u00a0S Du, and Sujay Sanghavi. 2021. Nearly horizon-free offline reinforcement learning. Advances in neural information processing systems 34 (2021), 15621\u201315634."},{"key":"e_1_2_1_57_1","unstructured":"Daniel Russo. 2019. Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems. 14433\u201314443."},{"key":"e_1_2_1_58_1","volume-title":"International Conference on Machine Learning. 19967\u201320025","author":"Shi Laixi","year":"2022","unstructured":"Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, and Yuejie Chi. 2022. Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning. 19967\u201320025."},{"key":"e_1_2_1_59_1","volume-title":"The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model. Advances in Neural Information Processing Systems","author":"Shi Laixi","year":"2023","unstructured":"Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, and Yuejie Chi. 2023. The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model. Advances in Neural Information Processing Systems (2023)."},{"key":"e_1_2_1_60_1","unstructured":"Aaron Sidford Mengdi Wang Xian Wu Lin Yang and Yinyu Ye. 2018. Near-optimal time and sample complexities for solving Markov decision processes with a generative model In Advances in Neural Information Processing Systems. arXiv:1806.01492 5186\u20135196."},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 770\u2013787","author":"Sidford Aaron","year":"2018","unstructured":"Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. 2018. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 770\u2013787."},{"key":"e_1_2_1_62_1","unstructured":"Max Simchowitz and Kevin\u00a0G Jamieson. 2019. Non-asymptotic gap-dependent regret bounds for tabular MDPs. In Advances in Neural Information Processing Systems. 1153\u20131162."},{"key":"e_1_2_1_63_1","volume-title":"Proceedings of the 23rd international conference on Machine learning. ACM, 881\u2013888","author":"Strehl L","year":"2006","unstructured":"Alexander\u00a0L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael\u00a0L Littman. 2006. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning. ACM, 881\u2013888."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcss.2007.08.009"},{"key":"e_1_2_1_65_1","unstructured":"Istv\u00e1n Szita and Csaba Szepesv\u00e1ri. 2010. Model-based reinforcement learning with nearly tight exploration complexity bounds. In ICML."},{"key":"e_1_2_1_66_1","unstructured":"Mohammad\u00a0Sadegh Talebi and Odalric-Ambrym Maillard. 2018. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626(2018)."},{"key":"e_1_2_1_67_1","volume-title":"Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. Advances in Neural Information Processing Systems 34","author":"Tarbouriech Jean","year":"2021","unstructured":"Jean Tarbouriech, Runlong Zhou, Simon\u00a0S Du, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. 2021. Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. Advances in Neural Information Processing Systems 34 (2021)."},{"key":"e_1_2_1_68_1","unstructured":"Andrea Tirinzoni Matteo Pirotta and Alessandro Lazaric. 2021. A fully problem-dependent regret lower bound for finite-horizon MDPs. arXiv preprint arXiv:2106.13013(2021)."},{"key":"e_1_2_1_69_1","volume-title":"International Conference on Machine Learning. 22384\u201322429","author":"Wagenmaker J","year":"2022","unstructured":"Andrew\u00a0J Wagenmaker, Yifang Chen, Max Simchowitz, Simon Du, and Kevin Jamieson. 2022. First-order regret in reinforcement learning with linear function approximation: A robust estimation approach. In International Conference on Machine Learning. 22384\u201322429."},{"key":"e_1_2_1_70_1","unstructured":"Martin\u00a0J Wainwright. 2019. Stochastic approximation with cone-contractive operators: Sharp \u2113\u221e-bounds for Q-learning. arXiv preprint arXiv:1905.06265(2019)."},{"key":"e_1_2_1_71_1","unstructured":"Martin\u00a0J Wainwright. 2019. Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697(2019)."},{"key":"e_1_2_1_72_1","unstructured":"Kaiwen Wang Kevin Zhou Runzhe Wu Nathan Kallus and Wen Sun. 2023. The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning. arXiv preprint arXiv:2305.15703(2023)."},{"key":"e_1_2_1_73_1","unstructured":"Ruosong Wang Simon\u00a0S Du Lin\u00a0F Yang and Sham\u00a0M Kakade. 2020. Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?. In Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_74_1","first-page":"14865","article-title":"On gap-dependent bounds for offline reinforcement learning","volume":"35","author":"Wang Xinqi","year":"2022","unstructured":"Xinqi Wang, Qiwen Cui, and Simon\u00a0S Du. 2022. On gap-dependent bounds for offline reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 14865\u201314877.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_75_1","volume-title":"Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems 34","author":"Xie Tengyang","year":"2021","unstructured":"Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. 2021. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems 34 (2021), 27395\u201327407."},{"key":"e_1_2_1_76_1","first-page":"6358","article-title":"Near-Optimal Randomized Exploration for Tabular Markov Decision Processes","volume":"35","author":"Xiong Zhihan","year":"2022","unstructured":"Zhihan Xiong, Ruoqi Shen, Qiwen Cui, Maryam Fazel, and Simon\u00a0S Du. 2022. Near-Optimal Randomized Exploration for Tabular Markov Decision Processes. Advances in Neural Information Processing Systems 35 (2022), 6358\u20136371.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_77_1","volume-title":"Conference on Learning Theory. 4438\u20134472","author":"Xu Haike","year":"2021","unstructured":"Haike Xu, Tengyu Ma, and Simon Du. 2021. Fine-grained gap-dependent bounds for tabular MDPs via adaptive multi-step bootstrap. In Conference on Learning Theory. 4438\u20134472."},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.2023.3299840"},{"key":"e_1_2_1_79_1","volume-title":"Q-learning with Logarithmic Regret. In International Conference on Artificial Intelligence and Statistics. 1576\u20131584","author":"Yang Kunhe","year":"2021","unstructured":"Kunhe Yang, Lin Yang, and Simon Du. 2021. Q-learning with Logarithmic Regret. In International Conference on Artificial Intelligence and Statistics. 1576\u20131584."},{"key":"e_1_2_1_80_1","unstructured":"Ming Yin Yaqi Duan Mengdi Wang and Yu-Xiang Wang. 2022. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804(2022)."},{"key":"e_1_2_1_81_1","volume-title":"International Conference on Machine Learning. 7304\u20137312","author":"Zanette Andrea","year":"2019","unstructured":"Andrea Zanette and Emma Brunskill. 2019. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds. In International Conference on Machine Learning. 7304\u20137312."},{"key":"e_1_2_1_82_1","volume-title":"Conference on Learning Theory. 4528\u20134531","author":"Zhang Zihan","year":"2021","unstructured":"Zihan Zhang, Xiangyang Ji, and Simon Du. 2021. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory. 4528\u20134531."},{"key":"e_1_2_1_83_1","volume-title":"Conference on Learning Theory. 3858\u20133904","author":"Zhang Zihan","year":"2022","unstructured":"Zihan Zhang, Xiangyang Ji, and Simon Du. 2022. Horizon-free reinforcement learning in polynomial time: the power of stationary policies. In Conference on Learning Theory. 3858\u20133904."},{"key":"e_1_2_1_84_1","unstructured":"Zihan Zhang Yuan Zhou and Xiangyang Ji. 2020. Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition. In Advances in Neural Information Processing Systems."},{"key":"e_1_2_1_85_1","volume-title":"International Conference on Machine Learning. 12653\u201312662","author":"Zhang Zihan","year":"2021","unstructured":"Zihan Zhang, Yuan Zhou, and Xiangyang Ji. 2021. Model-free reinforcement learning: from clipped pseudo-regret to sample complexity. In International Conference on Machine Learning. 12653\u201312662."},{"key":"e_1_2_1_86_1","unstructured":"Heyang Zhao Jiafan He Dongruo Zhou Tong Zhang and Quanquan Gu. 2023. Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. arXiv preprint arXiv:2302.10371(2023)."},{"key":"e_1_2_1_87_1","volume-title":"International Conference on Machine Learning. 42878\u201342914","author":"Zhou Runlong","year":"2023","unstructured":"Runlong Zhou, Zhang Zihan, and Simon\u00a0Shaolei Du. 2023. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments. In International Conference on Machine Learning. 42878\u201342914."}],"container-title":["Journal of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3733592","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,2]],"date-time":"2025-05-02T11:52:14Z","timestamp":1746186734000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3733592"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,2]]},"references-count":87,"alternative-id":["10.1145\/3733592"],"URL":"https:\/\/doi.org\/10.1145\/3733592","relation":{},"ISSN":["0004-5411","1557-735X"],"issn-type":[{"type":"print","value":"0004-5411"},{"type":"electronic","value":"1557-735X"}],"subject":[],"published":{"date-parts":[[2025,5,2]]},"assertion":[{"value":"2024-06-11","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-23","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-02","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"3733592"}}