{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T12:49:14Z","timestamp":1768999754950,"version":"3.49.0"},"reference-count":59,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2024,6,7]],"date-time":"2024-06-07T00:00:00Z","timestamp":1717718400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,6,7]],"date-time":"2024-06-07T00:00:00Z","timestamp":1717718400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["UIDB\/50021\/2020"],"award-info":[{"award-number":["UIDB\/50021\/2020"]}]},{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["PTDC\/CCI-COM\/5060\/2021"],"award-info":[{"award-number":["PTDC\/CCI-COM\/5060\/2021"]}]},{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["PTDC\/CCI-COM\/7203\/2020"],"award-info":[{"award-number":["PTDC\/CCI-COM\/7203\/2020"]}]},{"DOI":"10.13039\/501100007601","name":"Horizon 2020","doi-asserted-by":"publisher","award":["952215"],"award-info":[{"award-number":["952215"]}],"id":[{"id":"10.13039\/501100007601","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000181","name":"Air Force Office of Scientific Research","doi-asserted-by":"publisher","award":["FA9550-22-1-0475"],"award-info":[{"award-number":["FA9550-22-1-0475"]}],"id":[{"id":"10.13039\/100000181","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["2021.04684.BD"],"award-info":[{"award-number":["2021.04684.BD"]}]},{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["2020.05360.BD"],"award-info":[{"award-number":["2020.05360.BD"]}]},{"DOI":"10.13039\/501100005765","name":"Universidade de Lisboa","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005765","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2024,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We study the interplay between the data distribution and <jats:italic>Q<\/jats:italic>-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of <jats:italic>Q<\/jats:italic>-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results, being primarily focused on offline settings. First, we analyze the impact of the data distribution by using optimization as a tool to better understand which data distributions yield low concentrability coefficients. We motivate high-entropy distributions from a game-theoretical point of view and propose an algorithm to find the optimal data distribution from the point of view of concentrability. Second, from an empirical perspective, we introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of <jats:italic>Q<\/jats:italic>-learning-based algorithms with function approximation. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline <jats:italic>Q<\/jats:italic>-learning-based algorithms under different environments. Our results attest to the importance of different properties of the data distribution such as entropy, coverage, and data quality (closeness to optimal policy).<\/jats:p>","DOI":"10.1007\/s10994-024-06564-5","type":"journal-article","created":{"date-parts":[[2024,6,7]],"date-time":"2024-06-07T17:01:54Z","timestamp":1717779714000},"page":"6141-6163","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["The impact of data distribution on Q-learning with function approximation"],"prefix":"10.1007","volume":"113","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4587-9528","authenticated-orcid":false,"given":"Pedro P.","family":"Santos","sequence":"first","affiliation":[]},{"given":"Diogo S.","family":"Carvalho","sequence":"additional","affiliation":[]},{"given":"Alberto","family":"Sardinha","sequence":"additional","affiliation":[]},{"given":"Francisco S.","family":"Melo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,6,7]]},"reference":[{"key":"6564_CR1","unstructured":"Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems."},{"key":"6564_CR2","unstructured":"Agarwal, R., Schuurmans, D., & Norouzi, M. (2019). An optimistic perspective on offline reinforcement learning. CoRR arxiv:1907.04543"},{"key":"6564_CR3","unstructured":"Al-Marjani, A., Tirinzoni, A., & Kaufmann, E. (2023). Active coverage for pac reinforcement learning."},{"key":"6564_CR4","unstructured":"Amortila, P., Jiang, N., & Xie, T. (2020). A variant of the Wang-Foster-Kakade lower bound for the discounted setting. CoRR arxiv:2011.01075."},{"key":"6564_CR5","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1007\/s10994-007-5038-2","volume":"71","author":"A Antos","year":"2008","unstructured":"Antos, A., Szepesvari, C., & Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71, 89\u2013129.","journal-title":"Machine Learning"},{"key":"6564_CR6","doi-asserted-by":"crossref","unstructured":"Arulkumaran, K., Deisenroth, M. P., Brundage, M., & Bharath, A. A. (2017). A brief survey of deep reinforcement learning. arXiv preprint arXiv:1708.05866.","DOI":"10.1109\/MSP.2017.2743240"},{"key":"6564_CR7","doi-asserted-by":"crossref","unstructured":"Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the twelfth international conference on machine learning (pp. 30\u201337). Morgan Kaufmann.","DOI":"10.1016\/B978-1-55860-377-6.50013-X"},{"key":"6564_CR8","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511804441","volume-title":"Convex optimization","author":"S Boyd","year":"2004","unstructured":"Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press."},{"key":"6564_CR9","first-page":"19412","volume":"33","author":"D Carvalho","year":"2020","unstructured":"Carvalho, D., Melo, F. S., & Santos, P. (2020). A new convergent variant of q-learning with linear function approximation. Advances in Neural Information Processing Systems, 33, 19412\u201319421.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6564_CR10","unstructured":"Chen, J., & Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. CoRR arxiv:1905.00360."},{"key":"6564_CR11","unstructured":"Farahmand, A. M., Szepesv\u00e1ri, C., & Munos, R. (2010). Error propagation for approximate policy and value iteration. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 23)."},{"key":"6564_CR12","unstructured":"Fu, J., Kumar, A., Nachum, O., Tucker, G., & Levine, S. (2020). D4RL: Datasets for deep data-driven reinforcement learning. CoRR arxiv:2004.07219."},{"key":"6564_CR13","unstructured":"Fu, J., Kumar, A., Soh, M., & Levine, S. (2019). Diagnosing bottlenecks in deep q-learning algorithms. CoRR arxiv:1902.10250."},{"key":"6564_CR14","unstructured":"G\u00fcl\u00e7ehre, \u00c7., Wang, Z., Novikov, A., Paine, T. L., Colmenarejo, S. G., Zolna, K., & de Freitas, N. (2020). RL unplugged: Benchmarks for offline reinforcement learning. CoRR arxiv:2006.13888."},{"key":"6564_CR15","unstructured":"Hazan, E., Kakade, S. M., Singh, K., & Soest, A. V. (2018). Provably efficient maximum entropy exploration. CoRR arxiv:1812.02690."},{"key":"6564_CR16","doi-asserted-by":"publisher","first-page":"62","DOI":"10.1007\/BF01580223","volume":"6","author":"M Held","year":"1974","unstructured":"Held, M., Wolfe, P., & Crowder, H. P. (1974). Validation of subgradient optimization. Mathematical Programming, 6, 62\u201388.","journal-title":"Mathematical Programming"},{"key":"6564_CR17","unstructured":"Hoffman, M., Shahriari, B., Aslanides, J., Barth-Maron, G., Behbahani, F., Norman, T., & de Freitas, N. (2020). Acme: A research framework for distributed reinforcement learning."},{"key":"6564_CR18","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4615-0015-5","volume-title":"Introduction to global optimization","author":"R Horst","year":"2000","unstructured":"Horst, R., Pardalos, P., & Van Thoai, N. (2000). Introduction to global optimization. Springer US."},{"key":"6564_CR19","unstructured":"Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., & Schapire, R. E. (2016). Contextual decision processes with low bellman rank are pac-learnable. CoRR arxiv:1610.09512."},{"key":"6564_CR20","unstructured":"Jin, C., Liu, Q., & Miryoosefi, S. (2021). Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. CoRR arxiv:2102.00815."},{"key":"6564_CR21","unstructured":"Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings 19th international conference on machine learning (pp. 267\u2013274)."},{"key":"6564_CR22","unstructured":"Kolter, J. (2011). The fixed points of off-policy td. In Advances in neural information processing systems (Vol. 24)."},{"key":"6564_CR23","unstructured":"Kumar, A., Fu, J., Tucker, G., & Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. CoRR arxiv:1906.00949."},{"key":"6564_CR24","unstructured":"Kumar, A., Gupta, A., & Levine, S. (2020). Discor: Corrective feedback in reinforcement learning via distribution correction. CoRR arxiv:2003.07305."},{"key":"6564_CR25","unstructured":"Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative q-learning for offline reinforcement learning. CoRR arxiv:2006.04779."},{"key":"6564_CR26","unstructured":"Lambert, N., Wulfmeier, M., Whitney, W. F., Byravan, A., Bloesch, M., Dasagi, V., & Riedmiller, M. A. (2022). The challenges of exploration for offline reinforcement learning. CoRR arxiv:2201.11861."},{"issue":"98","key":"6564_CR27","first-page":"3041","volume":"13","author":"A Lazaric","year":"2012","unstructured":"Lazaric, A., Ghavamzadeh, M., & Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13(98), 3041\u20133074.","journal-title":"Journal of Machine Learning Research"},{"issue":"19","key":"6564_CR28","first-page":"1","volume":"17","author":"A Lazaric","year":"2016","unstructured":"Lazaric, A., Ghavamzadeh, M., & Munos, R. (2016). Analysis of classification-based policy iteration algorithms. Journal of Machine Learning Research, 17(19), 1\u201330.","journal-title":"Journal of Machine Learning Research"},{"key":"6564_CR29","unstructured":"Lee, L., Eysenbach, B., Parisotto, E., Xing, E. P., Levine, S., & Salakhutdinov, R. (2019). Efficient exploration via state marginal matching. CoRR arxiv:1906.05274."},{"key":"6564_CR30","unstructured":"Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. CoRR arxiv:2005.01643."},{"issue":"10","key":"6564_CR31","doi-asserted-by":"publisher","first-page":"4394","DOI":"10.1109\/TIT.2006.881731","volume":"52","author":"F Liese","year":"2006","unstructured":"Liese, F., & Vajda, I. (2006). On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10), 4394\u20134412.","journal-title":"IEEE Transactions on Information Theory"},{"key":"6564_CR32","unstructured":"Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., & Wierstra, D. (2016). Continuous control with deep reinforcement learning. CoRR arxiv:1509.02971."},{"key":"6564_CR33","unstructured":"Liu, V., Kumaraswamy, R., Le, L., & White, M. (2018). The utility of sparse representations for control in reinforcement learning. CoRR arxiv:1811.06626."},{"key":"6564_CR34","unstructured":"Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., & Mart\u00edn-Mart\u00edn, R. (2021). What matters in learning from offline human demonstrations for robot manipulation. CoRR arxiv:2108.03298."},{"issue":"3","key":"6564_CR35","doi-asserted-by":"publisher","first-page":"455","DOI":"10.1137\/0607052","volume":"7","author":"OL Mangasarian","year":"1986","unstructured":"Mangasarian, O. L., & Shiau, T. H. (1986). A variable-complexity norm maximization problem. SIAM Journal on Algebraic Discrete Methods, 7(3), 455\u2013461.","journal-title":"SIAM Journal on Algebraic Discrete Methods"},{"issue":"7540","key":"6564_CR36","doi-asserted-by":"publisher","first-page":"529","DOI":"10.1038\/nature14236","volume":"518","author":"V Mnih","year":"2015","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2015). Playing atari with deep reinforcement learning. Nature, 518(7540), 529\u2013533.","journal-title":"Nature"},{"key":"6564_CR37","unstructured":"Munos, R. (2003). Error bounds for approximate policy iteration. In International conference on machine learning (Vol. 3, pp. 560\u2013567)."},{"key":"6564_CR38","unstructured":"Munos, R. (2005). Error bounds for approximate value iteration. In Aaai conference on artificial intelligence (pp. 1006\u20131011)."},{"issue":"2","key":"6564_CR39","doi-asserted-by":"publisher","first-page":"541","DOI":"10.1137\/040614384","volume":"46","author":"R Munos","year":"2007","unstructured":"Munos, R. (2007). Performance bounds in $$L_p$$ norm for approximate value iteration. SIAM Journal on Control and Optimization, 46(2), 541\u2013561.","journal-title":"SIAM Journal on Control and Optimization"},{"issue":"27","key":"6564_CR40","first-page":"815","volume":"9","author":"R Munos","year":"2008","unstructured":"Munos, R., & Szepesv\u00e1ri, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27), 815\u2013857.","journal-title":"Journal of Machine Learning Research"},{"key":"6564_CR41","volume-title":"Markov decision processes: Discrete stochastic dynamic programming","author":"ML Puterman","year":"2014","unstructured":"Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons."},{"key":"6564_CR42","unstructured":"Qin, R., Gao, S., Zhang, X., Xu, Z., Huang, S., Li, Z., & Yu, Y. (2021). Neorl: A near real-world benchmark for offline reinforcement learning. CoRR arxiv:2102.00714."},{"key":"6564_CR43","doi-asserted-by":"crossref","unstructured":"Riedmiller, M. (2005). Neural fitted q iteration \u2013 first experiences with a data efficient neural reinforcement learning method. In Machine learning: Ecml 2005 (pp. 317\u2013328).","DOI":"10.1007\/11564096_32"},{"key":"6564_CR44","unstructured":"Schweighofer, K., Hofmarcher, M., Dinu, M., Renz, P., Bitto-Nemling, A., Patil, V. P., & Hochreiter, S. (2021). Understanding the effects of dataset characteristics on offline reinforcement learning. CoRR arxiv:2111.04714."},{"issue":"7676","key":"6564_CR45","doi-asserted-by":"publisher","first-page":"354","DOI":"10.1038\/nature24270","volume":"550","author":"D Silver","year":"2017","unstructured":"Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., & Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354\u2013359.","journal-title":"Nature"},{"key":"6564_CR46","volume-title":"Reinforcement learning: An introduction","author":"R Sutton","year":"2018","unstructured":"Sutton, R., & Barto, A. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT Press.","edition":"2"},{"key":"6564_CR47","unstructured":"Tosatto, S., Pirotta, M., D\u2019Eramo, C., & Restelli, M. (2017). Boosted fitted q-iteration. In Proceedings of the 34th international conference on machine learning (pp. 3434\u20133443)."},{"issue":"5","key":"6564_CR48","doi-asserted-by":"publisher","first-page":"674","DOI":"10.1109\/9.580874","volume":"42","author":"J Tsitsiklis","year":"1997","unstructured":"Tsitsiklis, J., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674\u2013690.","journal-title":"IEEE Transactions on Automatic Control"},{"issue":"1","key":"6564_CR49","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1007\/BF00114724","volume":"22","author":"JN Tsitsiklis","year":"1996","unstructured":"Tsitsiklis, J. N., & van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22(1), 59\u201394.","journal-title":"Machine Learning"},{"key":"6564_CR50","unstructured":"Uehara, M., & Sun, W. (2021). Pessimistic model-based offline RL: PAC bounds and posterior sampling under partial coverage. CoRR arxiv:2107.06226."},{"key":"6564_CR51","unstructured":"van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., & Modayil, J. (2018). Deep reinforcement learning and the deadly triad. CoRR arxiv:1812.02648."},{"key":"6564_CR52","unstructured":"Wang, R., Foster, D. P., & Kakade, S. M. (2020). What are the statistical limits of offline RL with linear function approximation? CoRR arxiv:2010.11895."},{"key":"6564_CR53","unstructured":"Wang, R., Wu, Y., Salakhutdinov, R., & Kakade, S. M. (2021). Instabilities of offline RL with pre-trained neural representation. CoRR arxiv:2103.04947."},{"key":"6564_CR54","unstructured":"Xie, T., Foster, D. J., Bai, Y., Jiang, N., & Kakade, S. M. (2022). The role of coverage in online reinforcement learning."},{"key":"6564_CR55","unstructured":"Xie, T., & Jiang, N. (2020). Batch value-function approximation with only realizability. CoRR arxiv:2008.04990."},{"key":"6564_CR56","unstructured":"Yang, Z., Xie, Y., & Wang, Z. (2019). A theoretical analysis of deep q-learning. CoRR arxiv:1901.00137."},{"key":"6564_CR57","unstructured":"Yarats, D., Brandfonbrener, D., Liu, H., Laskin, M., Abbeel, P. , Lazaric, A., & Pinto, L. (2022). Don\u2019t change the algorithm, change the data: Exploratory data for offline reinforcement learning. CoRR arxiv:2201.13425."},{"key":"6564_CR58","unstructured":"Zanette, A. (2020). Exponential lower bounds for batch reinforcement learning: Batch RL can be exponentially harder than online RL. CoRR arxiv:2012.08005."},{"key":"6564_CR59","unstructured":"Zhang, S., Yao, H., & Whiteson, S. (2021). Breaking the deadly triad with a target network. CoRR arxiv:2101.08862."}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06564-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-024-06564-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-024-06564-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,7]],"date-time":"2024-08-07T17:22:22Z","timestamp":1723051342000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-024-06564-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,7]]},"references-count":59,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2024,9]]}},"alternative-id":["6564"],"URL":"https:\/\/doi.org\/10.1007\/s10994-024-06564-5","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"value":"0885-6125","type":"print"},{"value":"1573-0565","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,6,7]]},"assertion":[{"value":"29 November 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 March 2024","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 April 2024","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"7 June 2024","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no relevant financial or non-financial interests to disclose.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"The developed code is publicly available .","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Code availability"}}]}}