{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T18:34:09Z","timestamp":1771612449158,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2015,7,15]],"date-time":"2015-07-15T00:00:00Z","timestamp":1436918400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["60745002,61175057"],"award-info":[{"award-number":["60745002,61175057"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2015,8,13]]},"abstract":"<jats:p>Markov decision processes (MDPs) provide a rich framework for planning under uncertainty. However, exactly solving a large MDP is usually intractable due to the \u201ccurse of dimensionality\u201d\u2014 the state space grows exponentially with the number of state variables. Online algorithms tackle this problem by avoiding computing a policy for the entire state space. On the other hand, since online algorithm has to find a near-optimal action online in almost real time, the computation time is often very limited. In the context of reinforcement learning, MAXQ is a value function decomposition method that exploits the underlying structure of the original MDP and decomposes it into a combination of smaller subproblems arranged over a task hierarchy. In this article, we present MAXQ-OP\u2014a novel online planning algorithm for large MDPs that utilizes MAXQ hierarchical decomposition in online settings. Compared to traditional online planning algorithms, MAXQ-OP is able to reach much more deeper states in the search tree with relatively less computation time by exploiting MAXQ hierarchical decomposition online. We empirically evaluate our algorithm in the standard Taxi domain\u2014a common benchmark for MDPs\u2014to show the effectiveness of our approach. We have also conducted a long-term case study in a highly complex simulated soccer domain and developed a team named WrightEagle that has won five world champions and five runners-up in the recent 10 years of RoboCup Soccer Simulation 2D annual competitions. The results in the RoboCup domain confirm the scalability of MAXQ-OP to very large domains.<\/jats:p>","DOI":"10.1145\/2717316","type":"journal-article","created":{"date-parts":[[2015,7,17]],"date-time":"2015-07-17T13:21:25Z","timestamp":1437139285000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Online Planning for Large Markov Decision Processes with Hierarchical Decomposition"],"prefix":"10.1145","volume":"6","author":[{"given":"Aijun","family":"Bai","sequence":"first","affiliation":[{"name":"University of Science and Technology of China, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Feng","family":"Wu","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiaoping","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2015,7,15]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Russell","author":"Andre David","year":"2002","unstructured":"David Andre and Stuart J . Russell . 2002 . State Abstraction for Programmable Reinforcement Learning Agents. Technical Report. University of California at Berkeley. David Andre and Stuart J. Russell. 2002. State Abstraction for Programmable Reinforcement Learning Agents. Technical Report. University of California at Berkeley."},{"key":"e_1_2_2_2_1","volume-title":"Proceedings of the FLAIRS Conference. 509--514","author":"Asadi Mehran","year":"2004","unstructured":"Mehran Asadi and Manfred Huber . 2004 . State space reduction for hierarchical reinforcement learning . In Proceedings of the FLAIRS Conference. 509--514 . Mehran Asadi and Manfred Huber. 2004. State space reduction for hierarchical reinforcement learning. In Proceedings of the FLAIRS Conference. 509--514."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1013689704352"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/2343896.2343928"},{"key":"e_1_2_2_5_1","unstructured":"Aijun Bai Feng Wu and Xiaoping Chen. 2013a. Bayesian mixture modelling and inference based Thompson sampling in Monte-Carlo tree search. In Advances in Neural Information Processing Systems 26. 1646--1654.  Aijun Bai Feng Wu and Xiaoping Chen. 2013a. Bayesian mixture modelling and inference based Thompson sampling in Monte-Carlo tree search. In Advances in Neural Information Processing Systems 26. 1646--1654."},{"key":"e_1_2_2_6_1","series-title":"Lecture Notes in Computer Science","volume-title":"RoboCup 2012: Robot Soccer World Cup XVI","author":"Bai Aijun","unstructured":"Aijun Bai , Feng Wu , and Xiaoping Chen . 2013b. Towards a principled solution to simulated robot soccer . In RoboCup 2012: Robot Soccer World Cup XVI . Lecture Notes in Computer Science , Vol. 7500 . Springer , 141--153. Aijun Bai, Feng Wu, and Xiaoping Chen. 2013b. Towards a principled solution to simulated robot soccer. In RoboCup 2012: Robot Soccer World Cup XVI. Lecture Notes in Computer Science, Vol. 7500. Springer, 141--153."},{"key":"e_1_2_2_7_1","volume-title":"Proceedings of the 8th Conference on Intelligent Autonomous Systems. 438--445","author":"Bakker Bram","year":"2004","unstructured":"Bram Bakker and J\u00fcrgen Schmidhuber . 2004 . Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization . In Proceedings of the 8th Conference on Intelligent Autonomous Systems. 438--445 . Bram Bakker and J\u00fcrgen Schmidhuber. 2004. Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Proceedings of the 8th Conference on Intelligent Autonomous Systems. 438--445."},{"key":"e_1_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2005.1545548"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/2283696.2283724"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/0004-3702(94)00011-O"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1025696116075"},{"key":"e_1_2_2_13_1","volume-title":"Dynamic Programming","author":"Bellman Richard","unstructured":"Richard Bellman . 1957. Dynamic Programming . Princeton University Press , Princeton, NJ . Richard Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ."},{"key":"e_1_2_2_14_1","volume-title":"Dynamic Programming and Optimal Control","author":"Bertsekas Dimitri P.","unstructured":"Dimitri P. Bertsekas . 1996. Dynamic Programming and Optimal Control . Athena Scientific . Dimitri P. Bertsekas. 1996. Dynamic Programming and Optimal Control. Athena Scientific."},{"key":"e_1_2_2_15_1","volume-title":"Proceedings of the 13th International Conference on Automated Planning and Scheduling.","author":"Bonet Blai","year":"2003","unstructured":"Blai Bonet and Hector Geffner . 2003 . Labeled RTDP: Improving the convergence of real-time dynamic programming . In Proceedings of the 13th International Conference on Automated Planning and Scheduling. Blai Bonet and Hector Geffner. 2003. Labeled RTDP: Improving the convergence of real-time dynamic programming. In Proceedings of the 13th International Conference on Automated Planning and Scheduling."},{"key":"e_1_2_2_16_1","volume-title":"UCT. In Proceedings of the AAAI Conference on Artificial Intelligence. 1749--1755","author":"Bonet Blai","year":"2012","unstructured":"Blai Bonet and Hector Geffner . 2012 . Action selection for MDPs: Anytime AO&ast; vs . UCT. In Proceedings of the AAAI Conference on Artificial Intelligence. 1749--1755 . Blai Bonet and Hector Geffner. 2012. Action selection for MDPs: Anytime AO&ast; vs. UCT. In Proceedings of the AAAI Conference on Artificial Intelligence. 1749--1755."},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCIAIG.2012.2186810"},{"key":"e_1_2_2_18_1","volume-title":"Proceedings of the IEEE International Conference on Robotics and Automation","volume":"2","author":"Dellaert Frank","year":"2001","unstructured":"Frank Dellaert , Dieter Fox , Wolfram Burgard , and Sebastian Thrun . 2001 . Monte Carlo localization for mobile robots . In Proceedings of the IEEE International Conference on Robotics and Automation , Vol. 2 . IEEE, Los Alamitos, CA, 1322--1328. Frank Dellaert, Dieter Fox, Wolfram Burgard, and Sebastian Thrun. 2001. Monte Carlo localization for mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation, Vol. 2. IEEE, Los Alamitos, CA, 1322--1328."},{"key":"e_1_2_2_19_1","first-page":"1","article-title":"Hierarchical reinforcement learning with the MAXQ value function decomposition","volume":"13","author":"Dietterich Thomas G.","year":"1999","unstructured":"Thomas G. Dietterich . 1999 a. Hierarchical reinforcement learning with the MAXQ value function decomposition . Journal of Machine Learning Research 13 , 1 , 63. Thomas G. Dietterich. 1999a. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Machine Learning Research 13, 1, 63.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_2_20_1","doi-asserted-by":"crossref","unstructured":"Thomas G. Dietterich. 1999b. State abstraction in MAXQ hierarchical reinforcement learning. arXiv preprint cs\/9905015.  Thomas G. Dietterich. 1999b. State abstraction in MAXQ hierarchical reinforcement learning. arXiv preprint cs\/9905015.","DOI":"10.1007\/3-540-44914-0_2"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1160633.1160686"},{"key":"e_1_2_2_22_1","unstructured":"Zohar Feldman and Carmel Domshlak. 2012. Simple regret optimization in online planning for Markov decision processes. arXiv preprint 1206.3382.  Zohar Feldman and Carmel Domshlak. 2012. Simple regret optimization in online planning for Markov decision processes. arXiv preprint 1206.3382."},{"key":"e_1_2_2_23_1","volume-title":"Proceedings of the 18th National Conference on Artificial Intelligence (AAAI\u201902)","author":"Feng Zhengzhu","unstructured":"Zhengzhu Feng and Eric A. Hansen . 2002. Symbolic heuristic search for factored Markov decision processes . In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI\u201902) . 455--460. Zhengzhu Feng and Eric A. Hansen. 2002. Symbolic heuristic search for factored Markov decision processes. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI\u201902). 455--460."},{"key":"e_1_2_2_24_1","series-title":"Lecture Notes in Computer Science","volume-title":"RoboCup 2010: Robot Soccer World Cup XIV","author":"Gabel Thomas","unstructured":"Thomas Gabel and Martin Riedmiller . 2011. On progress in RoboCup: The simulation league showcase . In RoboCup 2010: Robot Soccer World Cup XIV . Lecture Notes in Computer Science , Vol. 6556 . Springer , 36--47. Thomas Gabel and Martin Riedmiller. 2011. On progress in RoboCup: The simulation league showcase. In RoboCup 2010: Robot Soccer World Cup XIV. Lecture Notes in Computer Science, Vol. 6556. Springer, 36--47."},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.artint.2011.03.007"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0004-3702(01)00106-0"},{"key":"e_1_2_2_27_1","volume-title":"Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. 220--229","author":"Hauskrecht Milos","year":"1998","unstructured":"Milos Hauskrecht , Nicolas Meuleau , Leslie Pack Kaelbling , Thomas Dean , and Craig Boutilier . 1998 . Hierarchical solution of Markov decision processes using macro-actions . In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. 220--229 . Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. 1998. Hierarchical solution of Markov decision processes using macro-actions. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. 220--229."},{"key":"e_1_2_2_28_1","volume-title":"Proceedings of the 19th International Conference on Machine Learning (ICML\u201902)","volume":"2","author":"Hengst Bernhard","year":"2002","unstructured":"Bernhard Hengst . 2002 . Discovering hierarchy in reinforcement learning with HEXQ . In Proceedings of the 19th International Conference on Machine Learning (ICML\u201902) , Vol. 2 . 243--250. Bernhard Hengst. 2002. Discovering hierarchy in reinforcement learning with HEXQ. In Proceedings of the 19th International Conference on Machine Learning (ICML\u201902), Vol. 2. 243--250."},{"key":"e_1_2_2_29_1","volume-title":"Machine Learning: ECML","author":"Hengst Bernhard","year":"2004","unstructured":"Bernhard Hengst . 2004. Model approximation for HEXQ hierarchical reinforcement learning . In Machine Learning: ECML 2004 . Lecture Notes in Computer Science, Vol. 3201 . Springer , 144--155. Bernhard Hengst. 2004. Model approximation for HEXQ hierarchical reinforcement learning. In Machine Learning: ECML 2004. Lecture Notes in Computer Science, Vol. 3201. Springer, 144--155."},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.5555\/1781238.1781248"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390211"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.5555\/1248547.1248628"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0004-3702(98)00023-X"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74024-7_7"},{"key":"e_1_2_2_35_1","volume-title":"Proceedings of the 16th International Joint Conference on Artificial Intelligence","volume":"2","author":"Kearns Michael","unstructured":"Michael Kearns , Yishay Mansour , and Andrew Y. Ng . 1999. A sparse sampling algorithm for near-optimal planning in large Markov decision processes . In Proceedings of the 16th International Joint Conference on Artificial Intelligence , Vol. 2 . 1324--1331. Michael Kearns, Yishay Mansour, and Andrew Y. Ng. 1999. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, Vol. 2. 1324--1331."},{"key":"e_1_2_2_36_1","volume-title":"Proceedings of the 23rd International Conference on Automated Planning and Scheduling (ICAPS\u201913)","author":"Keller Thomas","year":"2013","unstructured":"Thomas Keller and Malte Helmert . 2013 . Trial-based heuristic tree search for finite horizon MDPs . In Proceedings of the 23rd International Conference on Automated Planning and Scheduling (ICAPS\u201913) . 135--143. Thomas Keller and Malte Helmert. 2013. Trial-based heuristic tree search for finite horizon MDPs. In Proceedings of the 23rd International Conference on Automated Planning and Scheduling (ICAPS\u201913). 135--143."},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/11871842_29"},{"key":"e_1_2_2_38_1","volume-title":"Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics (ISAIM\u201906)","author":"Li Lihong","unstructured":"Lihong Li , Thomas J. Walsh , and Michael L. Littman . 2006. Towards a unified theory of state abstraction for MDPs . In Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics (ISAIM\u201906) . Lihong Li, Thomas J. Walsh, and Michael L. Littman. 2006. Towards a unified theory of state abstraction for MDPs. In Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics (ISAIM\u201906)."},{"key":"e_1_2_2_39_1","volume-title":"Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. 394--402","author":"Littman Michael L.","unstructured":"Michael L. Littman , Thomas L. Dean , and Leslie P. Kaelbling . 1995. On the complexity of solving Markov decision problems . In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. 394--402 . Michael L. Littman, Thomas L. Dean, and Leslie P. Kaelbling. 1995. On the complexity of solving Markov decision problems. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. 394--402."},{"key":"e_1_2_2_40_1","volume-title":"Proceedings of the ICML 2005 Workshop on Rich Representations for Reinforcement Learning. 39--44","author":"Manfredi Victoria","year":"2005","unstructured":"Victoria Manfredi and Sridhar Mahadevan . 2005 . Hierarchical reinforcement learning using graphical models . In Proceedings of the ICML 2005 Workshop on Rich Representations for Reinforcement Learning. 39--44 . Victoria Manfredi and Sridhar Mahadevan. 2005. Hierarchical reinforcement learning using graphical models. In Proceedings of the ICML 2005 Workshop on Rich Representations for Reinforcement Learning. 39--44."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102423"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390238"},{"key":"e_1_2_2_43_1","doi-asserted-by":"publisher","DOI":"10.1609\/aimag.v32i1.2342"},{"key":"e_1_2_2_44_1","series-title":"Lecture Notes in Computer Science","volume-title":"Reasoning, Action and Interaction in AI Theories and Systems","author":"Nardi Daniele","unstructured":"Daniele Nardi and Luca Iocchi . 2006. Artificial intelligence in RoboCup . In Reasoning, Action and Interaction in AI Theories and Systems . Lecture Notes in Computer Science , Vol. 4155 . Springer , 193--211. Daniele Nardi and Luca Iocchi. 2006. Artificial intelligence in RoboCup. In Reasoning, Action and Interaction in AI Theories and Systems. Lecture Notes in Computer Science, Vol. 4155. Springer, 193--211."},{"key":"e_1_2_2_45_1","volume-title":"Principles of Artificial Intelligence","author":"Nilsson Nils J.","unstructured":"Nils J. Nilsson . 1982. Principles of Artificial Intelligence . Springer . Nils J. Nilsson. 1982. Principles of Artificial Intelligence. Springer."},{"key":"e_1_2_2_46_1","volume-title":"Markov Decision Processes: Discrete Stochastic Dynamic Programming","author":"Puterman Martin L.","unstructured":"Martin L. Puterman . 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons . Martin L. Puterman. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10514-009-9120-4"},{"key":"e_1_2_2_48_1","volume-title":"Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI\u201909)","author":"Sanner Scott","year":"2009","unstructured":"Scott Sanner , Robby Goetschalckx , Kurt Driessens , and Guy Shani . 2009 . Bayesian real-time dynamic programming . In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI\u201909) . 1784--1789. Scott Sanner, Robby Goetschalckx, Kurt Driessens, and Guy Shani. 2009. Bayesian real-time dynamic programming. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI\u201909). 1784--1789."},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/1102351.1102454"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.5555\/518662"},{"key":"e_1_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1177\/105971230501300301"},{"key":"e_1_2_2_53_1","volume-title":"Barto","author":"Sutton Richard S.","year":"1998","unstructured":"Richard S. Sutton and Andrew G . Barto . 1998 . Reinforcement Learning : An Introduction, Vol. 116 . Cambridge University Press . Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction, Vol. 116. Cambridge University Press."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0004-3702(99)00052-1"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.5555\/1577069.1755839"}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2717316","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2717316","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T07:00:44Z","timestamp":1750230044000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2717316"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,7,15]]},"references-count":53,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,8,13]]}},"alternative-id":["10.1145\/2717316"],"URL":"https:\/\/doi.org\/10.1145\/2717316","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"value":"2157-6904","type":"print"},{"value":"2157-6912","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,7,15]]},"assertion":[{"value":"2014-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-07-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}