{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,4]],"date-time":"2026-05-04T18:56:32Z","timestamp":1777920992966,"version":"3.51.4"},"reference-count":35,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T00:00:00Z","timestamp":1736380800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are O~(TBT), where T is the time horizon and BT is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.<\/jats:p>","DOI":"10.3390\/e27010051","type":"journal-article","created":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T09:54:27Z","timestamp":1736416467000},"page":"51","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Thompson Sampling for Non-Stationary Bandit Problems"],"prefix":"10.3390","volume":"27","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-7684-9882","authenticated-orcid":false,"given":"Han","family":"Qi","sequence":"first","affiliation":[{"name":"School of Software Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-5209-2309","authenticated-orcid":false,"given":"Fei","family":"Guo","sequence":"additional","affiliation":[{"name":"School of Software Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2136-3196","authenticated-orcid":false,"given":"Li","family":"Zhu","sequence":"additional","affiliation":[{"name":"School of Software Engineering, Xi\u2019an Jiaotong University, Xi\u2019an 710049, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,1,9]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"527","DOI":"10.1090\/S0002-9904-1952-09620-8","article-title":"Some aspects of the sequential design of experiments","volume":"58","author":"Robbins","year":"1952","journal-title":"Bull. Am. Math. Soc."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Li, L., Chu, W., Langford, J., and Wang, X. (2011, January 9\u201312). Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. Proceedings of the fourth ACM International Conference on Web Search and Data Mining, Hong Kong, China.","DOI":"10.1145\/1935826.1935878"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Bouneffouf, D., Bouzeghoub, A., and Ganarski, A.L. (2012). A contextual-bandit algorithm for mobile context-aware recommender system. Neural Information Processing, Proceedings of the International Conference, ICONIP 2012, Doha, Qatar, 12\u201315 November 2012, Springer.","DOI":"10.1007\/978-3-642-34487-9_40"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Li, S., Karatzoglou, A., and Gentile, C. (2016, January 17\u201321). Collaborative filtering bandits. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy.","DOI":"10.1145\/2911451.2911548"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"500","DOI":"10.1287\/mksc.2016.1023","article-title":"Customer acquisition via display advertising using multi-armed bandit experiments","volume":"36","author":"Schwartz","year":"2017","journal-title":"Mark. Sci."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Wu, Q., Iyer, N., and Wang, H. (2018, January 8\u201312). Learning contextual bandits in a non-stationary environment. Proceedings of the The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA.","DOI":"10.1145\/3209978.3210051"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Liu, F., Lee, J., and Shroff, N. (2018, January 2\u20137). A change-detection based framework for piecewise-stationary multi-armed bandit problem. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11746"},{"key":"ref_8","unstructured":"Cao, Y., Wen, Z., Kveton, B., and Xie, Y. (2019, January 16\u201318). Nearly optimal adaptive procedure with change detection for piecewise-stationary bandit. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Okinawa, Japan."},{"key":"ref_9","unstructured":"Auer, P., Gajane, P., and Ortner, R. (2019, January 25\u201328). Adaptively tracking the best bandit arm with an unknown number of distribution changes. Proceedings of the Conference on Learning Theory, Phoenix, AZ, USA."},{"key":"ref_10","unstructured":"Chen, Y., Lee, C.W., Luo, H., and Wei, C.Y. (2019, January 25\u201328). A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. Proceedings of the Conference on Learning Theory, Phoenix, AZ, USA."},{"key":"ref_11","first-page":"1","article-title":"Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits","volume":"23","author":"Besson","year":"2022","journal-title":"J. Mach. Learn. Res."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Garivier, A., and Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. Algorithmic Learning Theory, Proceedings of the 22nd International Conference, ALT 2011, Espoo, Finland, 5\u20137 October 2011, Springer.","DOI":"10.1007\/978-3-642-24412-4_16"},{"key":"ref_13","unstructured":"Raj, V., and Kalyani, S. (2017). Taming non-stationary bandits: A Bayesian approach. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"311","DOI":"10.1613\/jair.1.11407","article-title":"Sliding-window thompson sampling for non-stationary settings","volume":"68","author":"Trovo","year":"2020","journal-title":"J. Artif. Intell. Res."},{"key":"ref_15","unstructured":"Baudry, D., Russac, Y., and Capp\u00e9, O. (2021, January 18\u201324). On Limited-Memory Subsampling Strategies for Bandits. Proceedings of the International Conference on Machine Learning, Online."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1670","DOI":"10.1109\/TC.2020.3022634","article-title":"A change-detection-based Thompson sampling framework for non-stationary bandits","volume":"70","author":"Ghatak","year":"2020","journal-title":"IEEE Trans. Comput."},{"key":"ref_17","unstructured":"Alami, R., and Azizi, O. (2020, January 6\u201312). Ts-glr: An adaptive thompson sampling for the switching multi-armed bandit problem. Proceedings of the NeurIPS 2020 Challenges of Real World Reinforcement Learning Workshop, Virtual."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Viappiani, P. (2013). Thompson sampling for bayesian bandits with resets. Algorithmic Decision Theory, Proceedings of the Third International Conference, ADT 2013, Bruxelles, Belgium, 12\u201314 November 2013, Springer. Proceedings 3.","DOI":"10.1007\/978-3-642-41575-3_31"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Gupta, N., Granmo, O.C., and Agrawala, A. (2011, January 18\u201321). Thompson sampling for dynamic multi-armed bandits. Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops, Honolulu, HI, USA.","DOI":"10.1109\/ICMLA.2011.144"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Cavenaghi, E., Sottocornola, G., Stella, F., and Zanker, M. (2021). Non stationary multi-armed bandit: Empirical evaluation of a new concept drift-aware algorithm. Entropy, 23.","DOI":"10.3390\/e23030380"},{"key":"ref_21","unstructured":"Liu, Y., Van Roy, B., and Xu, K. (2023, January 25\u201327). Nonstationary bandit learning via predictive sampling. Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain."},{"key":"ref_22","unstructured":"Fiandri, M., Metelli, A.M., and Trov\u00f2, F. (2024). Sliding-Window Thompson Sampling for Non-Stationary Settings. arXiv."},{"key":"ref_23","unstructured":"Qi, H., Wang, Y., and Zhu, L. (2023). Discounted thompson sampling for non-stationary bandit problems. arXiv."},{"key":"ref_24","unstructured":"Kocsis, L., and Szepesv\u00e1ri, C. (2006, January 10\u201312). Discounted ucb. Proceedings of the 2nd PASCAL Challenges Workshop, Venice, Italy."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1137\/S0097539701398375","article-title":"The nonstochastic multiarmed bandit problem","volume":"32","author":"Auer","year":"2002","journal-title":"SIAM J. Comput."},{"key":"ref_26","unstructured":"Besbes, O., Gur, Y., and Zeevi, A. (2014, January 8\u201313). Stochastic multi-armed-bandit problem with non-stationary rewards. Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada."},{"key":"ref_27","unstructured":"Combes, R., and Proutiere, A. (2014, January 21\u201326). Unimodal bandits: Regret lower bounds and optimal algorithms. Proceedings of the International Conference on Machine Learning, Beijing, China."},{"key":"ref_28","unstructured":"Mellor, J., and Shapiro, J. (May, January 29). Thompson sampling in switching environments with Bayesian online change detection. Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA."},{"key":"ref_29","unstructured":"Suk, J., and Kpotufe, S. (2022, January 2\u20135). Tracking Most Significant Arm Switches in Bandits. Proceedings of the Conference on Learning Theory, London, UK."},{"key":"ref_30","unstructured":"Qin, Y., Menara, T., Oymak, S., Ching, S., and Pasqualetti, F. (2021, January 18-24). Non-stationary representation learning in sequential multi-armed bandits. Proceedings of the ICML Workshop on Reinforcement Learning Theory, Virtual."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1109\/OJCSYS.2022.3178540","article-title":"Non-stationary representation learning in sequential linear bandits","volume":"1","author":"Qin","year":"2022","journal-title":"IEEE Open J. Control Syst."},{"key":"ref_32","unstructured":"Agrawal, S., and Goyal, N. (May, January 29). Further optimal regret bounds for thompson sampling. Proceedings of the Artificial Intelligence and Statistics, Scottsdale, AZ, USA."},{"key":"ref_33","unstructured":"Jin, T., Xu, P., Shi, J., Xiao, X., and Gu, Q. (2021, January 18\u201324). Mots: Minimax optimal thompson sampling. Proceedings of the International Conference on Machine Learning, Online."},{"key":"ref_34","unstructured":"Jin, T., Xu, P., Xiao, X., and Anandkumar, A. (December, January 28). Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits. Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA."},{"key":"ref_35","unstructured":"Abramowitz, M., and Stegun, I.A. (1964). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/1\/51\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,8]],"date-time":"2025-10-08T10:25:50Z","timestamp":1759919150000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/27\/1\/51"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,9]]},"references-count":35,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,1]]}},"alternative-id":["e27010051"],"URL":"https:\/\/doi.org\/10.3390\/e27010051","relation":{},"ISSN":["1099-4300"],"issn-type":[{"value":"1099-4300","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,9]]}}}