{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T07:19:46Z","timestamp":1740122386411,"version":"3.37.3"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T00:00:00Z","timestamp":1734480000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T00:00:00Z","timestamp":1734480000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001858","name":"VINNOVA","doi-asserted-by":"publisher","award":["2017-04617"],"award-info":[{"award-number":["2017-04617"]}],"id":[{"id":"10.13039\/501100001858","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100019290","name":"Halmstad University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100019290","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Min Knowl Disc"],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>We propose a new sequential decision-making setting, combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state that is not directly observable by the agent. Each state is associated with a context distribution, possibly corrupted, allowing the agent to identify the state. Furthermore, states evolve in a Markovian fashion, providing useful information to estimate the current state via state history. In the proposed problem setting, we tackle the challenge of deciding on which of the two sources of information the agent should base its action selection. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Our setting is motivated by adaptive mobile health (mHealth) interventions. Users transition through different, time-correlated, but only partially observable internal states, determining their current needs. The side information associated with each internal state might not always be reliable, and standard approaches solely rely on the context risk of incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We analyze our setting and algorithm in terms of regret lower bound and upper bounds and evaluate our method on simulated medication adherence intervention data and several real-world data sets, showing improved empirical performance compared to several popular algorithms.<\/jats:p>","DOI":"10.1007\/s10618-024-01082-3","type":"journal-article","created":{"date-parts":[[2024,12,18]],"date-time":"2024-12-18T12:51:43Z","timestamp":1734526303000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A new bandit setting balancing information from state evolution and corrupted context"],"prefix":"10.1007","volume":"39","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7453-9186","authenticated-orcid":false,"given":"Alexander","family":"Galozy","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7796-5201","authenticated-orcid":false,"given":"S\u0142awomir","family":"Nowaczyk","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1145-4297","authenticated-orcid":false,"given":"Mattias","family":"Ohlsson","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,12,18]]},"reference":[{"issue":"3","key":"1082_CR1","doi-asserted-by":"publisher","first-page":"323","DOI":"10.1080\/0144341940140306","volume":"14","author":"R Abouserie","year":"1994","unstructured":"Abouserie R (1994) Sources and levels of stress in relation to locus of control and self esteem in university students. Educ Psychol 14(3):323\u2013330. https:\/\/doi.org\/10.1080\/0144341940140306","journal-title":"Educ Psychol"},{"key":"1082_CR2","unstructured":"Agarwal A, Luo H, Neyshabur B, et\u00a0al (2017) Corralling a band of bandit algorithms. In: Kale S, Shamir O (eds) Proceedings of the 2017 Conference on Learning Theory, Proceedings of Machine Learning Research, vol\u00a065. PMLR, pp 12\u201338, https:\/\/proceedings.mlr.press\/v65\/agarwal17b.html"},{"key":"1082_CR3","unstructured":"Agrawal S, Goyal N (2013a) Further optimal regret bounds for thompson sampling. In: Carvalho CM, Ravikumar P (eds) Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol\u00a031. PMLR, Scottsdale, Arizona, USA, pp 99\u2013107, https:\/\/proceedings.mlr.press\/v31\/agrawal13a.html"},{"key":"1082_CR4","unstructured":"Agrawal S, Goyal N (2013b) Thompson sampling for contextual bandits with linear payoffs. Machine Learning Research, vol\u00a028. PMLR, Atlanta, Georgia, USA, pp 127\u2013135, http:\/\/proceedings.mlr.press\/v28\/agrawal13.html"},{"key":"1082_CR5","volume-title":"Neural Inf Process","author":"R Allesiardo","year":"2014","unstructured":"Allesiardo R, F\u00e9raud R, Bouneffouf D (2014) A neural networks committee for the contextual bandit problem. In: Loo CK, Yap KS, Wong KW et al (eds) Neural Inf Process. Springer, Cham"},{"key":"1082_CR6","doi-asserted-by":"publisher","first-page":"39742","DOI":"10.5555\/944919.944941","volume":"3","author":"P Auer","year":"2003","unstructured":"Auer P (2003) Using confidence bounds for exploitation-exploration trade-offs. J Mach Learn Res 3:39742. https:\/\/doi.org\/10.5555\/944919.944941","journal-title":"J Mach Learn Res"},{"key":"1082_CR7","doi-asserted-by":"publisher","first-page":"23525","DOI":"10.1023\/A:1013689704352","volume":"47","author":"P Auer","year":"2002","unstructured":"Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47:23525. https:\/\/doi.org\/10.1023\/A:1013689704352","journal-title":"Mach Learn"},{"issue":"1","key":"1082_CR8","doi-asserted-by":"publisher","first-page":"487","DOI":"10.1137\/S0097539701398375","volume":"32","author":"P Auer","year":"2003","unstructured":"Auer P, Cesa-Bianchi N, Freund Y et al (2003) The nonstochastic multiarmed bandit problem. SIAM J Comput 32(1):487. https:\/\/doi.org\/10.1137\/S0097539701398375","journal-title":"SIAM J Comput"},{"key":"1082_CR9","unstructured":"Auer P, Gajane P, Ortner R (2019) Adaptively tracking the best bandit arm with an unknown number of distribution changes. In: Beygelzimer A, Hsu D (eds) Proceedings of the Thirty-Second Conference on Learning Theory, Proceedings of Machine Learning Research, vol\u00a099. PMLR, Phoenix, USA, pp 138\u2013158, http:\/\/proceedings.mlr.press\/v99\/auer19a.html"},{"key":"1082_CR10","unstructured":"Baltrunas L, Church K, Karatzoglou A, et\u00a0al (2015) Frappe: Understanding the usage and perception of mobile app recommendations in-the-wild. CoRR abs\/1505.03014. http:\/\/arxiv.org\/abs\/1505.03014, arXiv:1505.03014"},{"issue":"1","key":"1082_CR11","doi-asserted-by":"publisher","first-page":"27629","DOI":"10.1287\/opre.2019.1902","volume":"68","author":"H Bastani","year":"2020","unstructured":"Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper Res 68(1):27629. https:\/\/doi.org\/10.1287\/opre.2019.1902","journal-title":"Oper Res"},{"issue":"2","key":"1082_CR12","doi-asserted-by":"publisher","first-page":"1","DOI":"10.3390\/a11020013","volume":"11","author":"S Boldrini","year":"2018","unstructured":"Boldrini S, De Nardis L, Caso G et al (2018) mumab: a multi-armed bandit model for wireless network selection. Algorithms 11(2):1. https:\/\/doi.org\/10.3390\/a11020013","journal-title":"Algorithms"},{"issue":"2","key":"1082_CR13","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/2652481","volume":"47","author":"G Bonnin","year":"2014","unstructured":"Bonnin G, Jannach D (2014) Automated generation of music playlists: survey and experiments. ACM Comput Surv 47(2):1. https:\/\/doi.org\/10.1145\/2652481","journal-title":"ACM Comput Surv"},{"key":"1082_CR14","doi-asserted-by":"publisher","unstructured":"Bouneffouf D (2021) Corrupted contextual bandits: Online learning with corrupted context. In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3145\u20133149,https:\/\/doi.org\/10.1109\/ICASSP39728.2021.9414300","DOI":"10.1109\/ICASSP39728.2021.9414300"},{"key":"1082_CR15","doi-asserted-by":"publisher","unstructured":"Bouneffouf D, Rish I, Cecchi G, et\u00a0al (2017) Context attentive bandits: contextual bandit with restricted context. In: International Joint Conference on Artificial Intelligence, IJCAI-17, pp 1468\u20131475,https:\/\/doi.org\/10.24963\/ijcai.2017\/203","DOI":"10.24963\/ijcai.2017\/203"},{"key":"1082_CR16","volume-title":"Neural Information Processing","author":"O Chapelle","year":"2011","unstructured":"Chapelle O, Li L (2011) An Empirical Evaluation of Thompson Sampling. In: Shawe-Taylor J, Zemel RS, Bartlett PL et al (eds) Neural Information Processing. Curran Associates Inc"},{"key":"1082_CR17","unstructured":"Chu W, Li L, Reyzin L, et\u00a0al (2011) Contextual bandits with linear payoff functions. Machine Learning Research, vol\u00a015. JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, pp 208\u2013214, http:\/\/proceedings.mlr.press\/v15\/chu11a.html"},{"key":"1082_CR18","doi-asserted-by":"crossref","unstructured":"Garivier A, Moulines E (2011) On upper-confidence bound policies for switching bandit problems. In: Proceedings of the 22nd International Conference on Algorithmic Learning Theory. Springer-Verlag, Berlin, Heidelberg, ALT\u201911, p 174-188, https:\/\/dl.acm.org\/doi\/10.5555\/2050345.2050365","DOI":"10.1007\/978-3-642-24412-4_16"},{"key":"1082_CR19","doi-asserted-by":"publisher","first-page":"148","DOI":"10.1111\/j.2517-6161.1979.tb01068.x","volume":"41","author":"J Gittins","year":"1979","unstructured":"Gittins J (1979) Bandit processes and dynamic allocation indices. J royal stat soc ser b-methodol 41:148\u2013164. https:\/\/doi.org\/10.1111\/j.2517-6161.1979.tb01068.x","journal-title":"J royal stat soc ser b-methodol"},{"key":"1082_CR20","doi-asserted-by":"publisher","unstructured":"Hariri N, Mobasher B, Burke R (2012) Context-aware music recommendation based on latenttopic sequential patterns. In: ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, RecSys \u201912, p 131-13https:\/\/doi.org\/10.1145\/2365952.2365979","DOI":"10.1145\/2365952.2365979"},{"issue":"11","key":"1082_CR21","doi-asserted-by":"publisher","DOI":"10.1098\/rsos.171377","volume":"4","author":"X Huo","year":"2017","unstructured":"Huo X, Fu F (2017) Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Soci Open Sci 4(11):171377. https:\/\/doi.org\/10.1098\/rsos.171377","journal-title":"Royal Soci Open Sci"},{"key":"1082_CR22","doi-asserted-by":"publisher","unstructured":"Jannach D, Lerche L, Kamehkhosh I (2015) Beyond \"hitting the hits\": Generating coherent music playlist continuations with the right tracks. In: ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, RecSys \u201915, p 187-19https:\/\/doi.org\/10.1145\/2792838.2800182","DOI":"10.1145\/2792838.2800182"},{"key":"1082_CR23","volume-title":"Algorithmic Learning Theory","author":"E Kaufmann","year":"2012","unstructured":"Kaufmann E, Korda N, Munos R (2012) Thompson sampling: an asymptotically optimal finite-time analysis. In: Bshouty NH, Stoltz G, Vayatis N et al (eds) Algorithmic Learning Theory. Springer, Berlin"},{"key":"1082_CR24","doi-asserted-by":"publisher","unstructured":"Kerkouche R, Alami R, F\u00e9raud R, et\u00a0al (2018) Node-based optimization of lora transmissions with multi-armed bandit algorithms. In: 2018 25th International Conference on Telecommunications (ICT), pp 521\u2013526https:\/\/doi.org\/10.1109\/ICT.2018.8464949","DOI":"10.1109\/ICT.2018.8464949"},{"key":"1082_CR25","unstructured":"Kocsis L, Szepesv\u00e1ri C (2006) Discounted ucb. 2nd PASCAL Challenges Workshop https:\/\/www.lri.fr\/~sebag\/Slides\/Venice\/Kocsis.pdf"},{"issue":"1","key":"1082_CR26","doi-asserted-by":"publisher","first-page":"4","DOI":"10.1016\/0196-8858(85)90002-8","volume":"6","author":"T Lai","year":"1985","unstructured":"Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6(1):4\u201322. https:\/\/doi.org\/10.1016\/0196-8858(85)90002-8","journal-title":"Adv Appl Math"},{"key":"1082_CR27","unstructured":"Langford J, Zhang T (2008) The epoch-greedy algorithm for multi-armed bandits with side information. In: Platt J, Koller D, Singer Y, et\u00a0al (eds) Neural Information Processing, vol\u00a020. Curran Associates, Inc., pp 817\u2013824, https:\/\/proceedings.neurips.cc\/paper\/2007\/file\/4b04a686b0ad13dce35fa99fa4161c65-Paper.pdf"},{"key":"1082_CR28","doi-asserted-by":"crossref","unstructured":"Li L, Chu W, Langford J et al (2010) A contextual-bandit approach to personalized news article recommendation. International conference on World wide web - WWW\u2019 10(1145\/1772690):1772758","DOI":"10.1145\/1772690.1772758"},{"key":"1082_CR29","doi-asserted-by":"publisher","unstructured":"Mei J, Xiao C, Szepesv\u00e1ri C, et\u00a0al (2020) On the global convergence rates of softmax policy gradient methods. In: Proceedings of the 37th International Conference on Machine Learning. JMLR.org, ICML\u201920,https:\/\/doi.org\/10.5555\/3524938.3525571","DOI":"10.5555\/3524938.3525571"},{"key":"1082_CR30","unstructured":"Mobasher B, Nakagawa M (2002) Impact of site characteristics on recommendation models based on association rules and sequential patterns. In: IEEE International Conference on Data Mining (ICDM\u20192002), Maebashi City, Japan, Citeseer"},{"key":"1082_CR31","doi-asserted-by":"publisher","unstructured":"Mobasher B, Honghua Dai, Tao Luo, et\u00a0al (2002) Using sequential and non-sequential patterns in predictive web usage mining tasks. In: IEEE International Conference on Data Mining, pp 669\u2013672,https:\/\/doi.org\/10.1109\/ICDM.2002.1184025","DOI":"10.1109\/ICDM.2002.1184025"},{"key":"1082_CR32","doi-asserted-by":"publisher","unstructured":"Natarajan N, Shin D, Dhillon IS (2013) Which app will you use next? collaborative filtering with interactional context. In: ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, RecSys \u201913, p 201-20https:\/\/doi.org\/10.1145\/2507157.2507186","DOI":"10.1145\/2507157.2507186"},{"key":"1082_CR33","doi-asserted-by":"publisher","first-page":"62","DOI":"10.1016\/j.tcs.2014.09.026","volume":"558","author":"R Ortner","year":"2014","unstructured":"Ortner R, Ryabko D, Auer P et al (2014) Regret bounds for restless markov bandits. Theor Comput Sci 558:62\u201376. https:\/\/doi.org\/10.1016\/j.tcs.2014.09.026. (algorithmic Learning Theory)","journal-title":"Theor Comput Sci"},{"key":"1082_CR34","doi-asserted-by":"publisher","unstructured":"Prochaska JO, Redding CA, Evers KE et al (2015) The transtheoretical model and stages of change. Health behavior: Theory, research, and practice 97. https:\/\/doi.org\/10.4278\/0890-1171-12.1.38","DOI":"10.4278\/0890-1171-12.1.38"},{"issue":"4","key":"1082_CR35","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3190616","volume":"51","author":"M Quadrana","year":"2018","unstructured":"Quadrana M, Cremonesi P, Jannach D (2018) Sequence-aware recommender systems. ACM Comput Surv 51(4):1","journal-title":"ACM Comput Surv"},{"key":"1082_CR36","doi-asserted-by":"publisher","first-page":"500","DOI":"10.1287\/mksc.2016.1023","volume":"36","author":"E Schwartz","year":"2017","unstructured":"Schwartz E, Bradlow E, Fader P (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Mark Sci 36:500","journal-title":"Mark Sci"},{"key":"1082_CR37","doi-asserted-by":"publisher","unstructured":"Shen W, Wang J, Jiang YG, et\u00a0al (2015) Portfolio choices with orthogonal bandit learning. In: Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, IJCAI\u201915, p 974-980,https:\/\/doi.org\/10.5555\/2832249.2832384","DOI":"10.5555\/2832249.2832384"},{"key":"1082_CR38","volume-title":"Reinforcement Learning: An Introduction","author":"RS Sutton","year":"2018","unstructured":"Sutton RS, Barto AG (2018) Reinforcement Learning: An Introduction. A Bradford Book, Cambridge"},{"issue":"3\/4","key":"1082_CR39","doi-asserted-by":"publisher","first-page":"285","DOI":"10.1093\/biomet\/25.3-4.285","volume":"25","author":"WR Thompson","year":"1933","unstructured":"Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3\/4):285\u2013294. https:\/\/doi.org\/10.1093\/biomet\/25.3-4.285","journal-title":"Biometrika"},{"key":"1082_CR40","doi-asserted-by":"publisher","first-page":"199","DOI":"10.1214\/14-STS504","volume":"30","author":"S Villar","year":"2015","unstructured":"Villar S, Bowden J, Wason J (2015) Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges. Stat Sci 30:199. https:\/\/doi.org\/10.1214\/14-STS504","journal-title":"Stat Sci"},{"issue":"8","key":"1082_CR41","doi-asserted-by":"publisher","first-page":"1569","DOI":"10.1109\/TKDE.2018.2866041","volume":"31","author":"Q Wang","year":"2019","unstructured":"Wang Q, Zeng C, Zhou W et al (2019) Online interactive collaborative filtering using multi-armed bandit with dependent arms. IEEE Trans Knowl Data Eng 31(8):1569\u20131580. https:\/\/doi.org\/10.1109\/TKDE.2018.2866041","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"1082_CR42","unstructured":"Wen Z, Kveton B, Valko M, et\u00a0al (2017) Online influence maximization under independent cascade model with semi-bandit feedback. In: Guyon I, Luxburg UV, Bengio S, et\u00a0al (eds) Advances in Neural Information Processing Systems, vol\u00a030. Curran Associates, Inc., pp 3022\u20133032, https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/7137debd45ae4d0ab9aa953017286b20-Paper.pdf"},{"key":"1082_CR43","doi-asserted-by":"publisher","first-page":"287","DOI":"10.2307\/3214163","volume":"25","author":"P Whittle","year":"1988","unstructured":"Whittle P (1988) Restless bandits: activity allocation in a changing world. J Appl Probab 25:287. https:\/\/doi.org\/10.2307\/3214163","journal-title":"J Appl Probab"},{"key":"1082_CR44","doi-asserted-by":"publisher","unstructured":"Yu JY, Mannor S (2009) Piecewise-stationary bandit problems with side observations. In: Proceedings of the 26th Annual International Conference on Machine Learning. Association for Computing Machinery, New York, NY, USA, ICML \u201909, p 1177-118https:\/\/doi.org\/10.1145\/1553374.1553524","DOI":"10.1145\/1553374.1553524"},{"key":"1082_CR45","doi-asserted-by":"publisher","unstructured":"Yun S, Nam JH, Mo S et al (2017). Contextual multi-armed bandits under feature uncertain. https:\/\/doi.org\/10.2172\/1345927","DOI":"10.2172\/1345927"},{"key":"1082_CR46","unstructured":"Zhou X, Xiong Y, Chen N, et\u00a0al (2021) Regime switching bandits. In: Ranzato M, Beygelzimer A, Dauphin Y, et\u00a0al (eds) Advances in Neural Information Processing Systems, vol\u00a034. Curran Associates, Inc., pp 4542\u20134554, https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2021\/file\/240ac9371ec2671ae99847c3ae2e6384-Paper.pdf"}],"container-title":["Data Mining and Knowledge Discovery"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-024-01082-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10618-024-01082-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10618-024-01082-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,16]],"date-time":"2025-01-16T11:45:56Z","timestamp":1737027956000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10618-024-01082-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,18]]},"references-count":46,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["1082"],"URL":"https:\/\/doi.org\/10.1007\/s10618-024-01082-3","relation":{},"ISSN":["1384-5810","1573-756X"],"issn-type":[{"type":"print","value":"1384-5810"},{"type":"electronic","value":"1573-756X"}],"subject":[],"published":{"date-parts":[[2024,12,18]]},"assertion":[{"value":"8 November 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 November 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 December 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflicting interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"9"}}