{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T07:02:36Z","timestamp":1760598156476,"version":"build-2065373602"},"reference-count":28,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2022,5,30]],"date-time":"2022-05-30T00:00:00Z","timestamp":1653868800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Poker has been considered a challenging problem in both artificial intelligence and game theory because poker is characterized by imperfect information and uncertainty, which are similar to many realistic problems like auctioning, pricing, cyber security, and operations. However, it is not clear that playing an equilibrium policy in multi-player games would be wise so far, and it is infeasible to theoretically validate whether a policy is optimal. Therefore, designing an effective optimal policy learning method has more realistic significance. This paper proposes an optimal policy learning method for multi-player poker games based on Actor-Critic reinforcement learning. Firstly, this paper builds the Actor network to make decisions with imperfect information and the Critic network to evaluate policies with perfect information. Secondly, this paper proposes a novel multi-player poker policy update method: asynchronous policy update algorithm (APU) and dual-network asynchronous policy update algorithm (Dual-APU) for multi-player multi-policy scenarios and multi-player sharing-policy scenarios, respectively. Finally, this paper takes the most popular six-player Texas hold \u2019em poker to validate the performance of the proposed optimal policy learning method. The experiments demonstrate the policies learned by the proposed methods perform well and gain steadily compared with the existing approaches. In sum, the policy learning methods of imperfect information games based on Actor-Critic reinforcement learning perform well on poker and can be transferred to other imperfect information games. Such training with perfect information and testing with imperfect information models show an effective and explainable approach to learning an approximately optimal policy.<\/jats:p>","DOI":"10.3390\/e24060774","type":"journal-article","created":{"date-parts":[[2022,5,30]],"date-time":"2022-05-30T10:05:14Z","timestamp":1653905114000},"page":"774","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning"],"prefix":"10.3390","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8327-555X","authenticated-orcid":false,"given":"Daming","family":"Shi","sequence":"first","affiliation":[{"name":"Department of Automation, Tsinghua University, Beijing 100084, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9607-2679","authenticated-orcid":false,"given":"Xudong","family":"Guo","sequence":"additional","affiliation":[{"name":"Department of Automation, Tsinghua University, Beijing 100084, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yi","family":"Liu","sequence":"additional","affiliation":[{"name":"Department of Automation, Tsinghua University, Beijing 100084, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenhui","family":"Fan","sequence":"additional","affiliation":[{"name":"Department of Automation, Tsinghua University, Beijing 100084, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2022,5,30]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Schaeffer, J. (1997). One Jump Ahead: Challenging Human Supremacy in Checkers. ICGA J., 20.","DOI":"10.3233\/ICG-1997-20207"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1016\/S0004-3702(01)00129-1","article-title":"Deep Blue","volume":"134","author":"Campbell","year":"2002","journal-title":"Artif. Intell."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"484","DOI":"10.1038\/nature16961","article-title":"Mastering the game of Go with deep neural networks and tree search","volume":"529","author":"Silver","year":"2016","journal-title":"Nature"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"354","DOI":"10.1038\/nature24270","article-title":"Mastering the game of Go without human knowledge","volume":"550","author":"Silver","year":"2017","journal-title":"Nature"},{"key":"ref_5","unstructured":"Rubin, J., and Watson, I. (2011). Computer Poker: A Review, Elsevier Science Publishers Ltd."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1126\/science.1259433","article-title":"Heads-up limit hold \u2019em poker is solved","volume":"347","author":"Bowling","year":"2015","journal-title":"Science"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1126\/science.aam6960","article-title":"DeepStack: Expert-level artificial intelligence in heads-up no-limit poker","volume":"356","author":"Schmid","year":"2017","journal-title":"Science"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"418","DOI":"10.1126\/science.aao1733","article-title":"Superhuman AI for heads-up no-limit poker: Libratus beats top professionals","volume":"359","author":"Brown","year":"2017","journal-title":"Science"},{"key":"ref_9","unstructured":"Heinrich, J., and Silver, D. (2016). Deep reinforcement learning from self-play in imperfect-information games. arXiv."},{"key":"ref_10","unstructured":"Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., P\u00e9rolat, J., Silver, D., and Graepel, T. (2017, January 4\u20139). A unified game-theoretic approach to multiagent reinforcement learning. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_11","unstructured":"Srinivasan, S., Lanctot, M., Zambaldi, V., P\u00e9rolat, J., Tuyls, K., Munos, R., and Bowling, M. (2018, January 3\u20138). Actor-Critic Policy Optimization in Partially Observable Multiagent Environments. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Zhang, J., and Liu, H. (2018, January 25\u201329). Reinforcement Learning with Monte Carlo Sampling in Imperfect Information Problems. Lecture Notes in Computer Science. Proceedings of the ICCC 2018, Salamanca, Spain.","DOI":"10.1007\/978-3-319-94307-7_5"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Yao, J., Zhang, Z., Xia, L., Yang, J., and Zhao, Q. (2020, January 20\u201322). Solving Imperfect Information Poker Games Using Monte Carlo Search and POMDP Models. Proceedings of the 2020 IEEE 9th Data Driven Control and Learning Systems Conference (DDCLS), Liuzhou, China.","DOI":"10.1109\/DDCLS49620.2020.9275053"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"eaay2400","DOI":"10.1126\/science.aay2400","article-title":"Superhuman AI for multiplayer poker","volume":"365","author":"Brown","year":"2019","journal-title":"Science"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1145\/1978721.1978730","article-title":"The lemonade stand game competition","volume":"10","author":"Zinkevich","year":"2011","journal-title":"ACM SIGecom Exch."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"156","DOI":"10.1109\/TSMCC.2007.913919","article-title":"A comprehensive survey of multi-agent reinforcement learning","volume":"Volume 38","author":"Busoniu","year":"2008","journal-title":"IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews"},{"key":"ref_17","unstructured":"Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv."},{"key":"ref_18","unstructured":"Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. (2016, January 19\u201324). Dueling Network Architectures for Deep Reinforcement Learning. Proceedings of the International Conference on Machine Learning, New York, NY, USA."},{"key":"ref_19","unstructured":"Hasselt, H.V., Guez, A., and Silver, D. (2016, January 12\u201317). Deep reinforcement learning with double W-learning. Proceedings of the 13th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA."},{"key":"ref_20","unstructured":"Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015, January 6\u201311). Trust Region Policy Optimization. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_21","unstructured":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv."},{"key":"ref_22","unstructured":"Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"529","DOI":"10.1038\/nature14236","article-title":"Human-level control through deep reinforcement learning","volume":"518","author":"Mnih","year":"2015","journal-title":"Nature"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Gupta, J.K., Egorov, M., and Kochenderfer, M. (2017, January 8\u201312). Cooperative Multi-agent Control Using Deep Reinforcement Learning. Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, Sao Paulo, Brazil.","DOI":"10.1007\/978-3-319-71682-4_5"},{"key":"ref_25","unstructured":"Heinrich, J., Lanctot, M., and Silver, D. (2015, January 6\u201311). Fictitious Self-Play in Extensive-Form Games. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_26","unstructured":"Sklansky, D., and Miller, E. (2006). No Limit Hold \u2019em: Theory and Practice, Two Plus Two Publishing LLC."},{"key":"ref_27","unstructured":"Krieger, L. (2009). Hold \u2019em Excellence-From Beginner to Winner, ConJelCo LLC. Chapter 5."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Coulom, R. (2007, January 29\u201331). Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. Computers and Games. Proceedings of the 5th International Conference, CG 2006, Turin, Italy.","DOI":"10.1007\/978-3-540-75538-8_7"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/6\/774\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T23:22:11Z","timestamp":1760138531000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/24\/6\/774"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,30]]},"references-count":28,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2022,6]]}},"alternative-id":["e24060774"],"URL":"https:\/\/doi.org\/10.3390\/e24060774","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2022,5,30]]}}}