{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T02:53:31Z","timestamp":1768877611249,"version":"3.49.0"},"reference-count":27,"publisher":"Association for Computing Machinery (ACM)","issue":"6","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62436010 and U23B2052"],"award-info":[{"award-number":["62436010 and U23B2052"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Beijing University of Posts and Telecommunications-China Mobile Research Institute Joint Innovation Center"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Intell. Syst. Technol."],"published-print":{"date-parts":[[2025,12,31]]},"abstract":"<jats:p>\n                    The rapid evolution of AI has driven advancements across numerous sectors. In the domain of government affairs, large language models (LLMs) hold significant potential for applications such as policy analysis, data processing, and decision support. However, their adoption in government settings faces considerable challenges, including data accessibility issues, the absence of standardized evaluation criteria, and concerns regarding model accuracy, reliability, and security. To address these challenges, we propose a comprehensive evaluation framework specifically designed for LLMs in government affairs. Built on modular principles, this framework ensures adaptability across various industries. Additionally, we introduce the Multi-Scenario Government Affairs Benchmark (MSGABench\n                    <jats:xref ref-type=\"fn\">\n                      <jats:sup>1<\/jats:sup>\n                    <\/jats:xref>\n                    ) dataset, a Chinese-language dataset specifically crafted to meet the practical needs of government professionals. Employing the proposed framework and the MSGA dataset, we conducted an empirical evaluation of 15 prominent LLMs, revealing critical insights: (1)\u2009Performance: Many models demonstrated low accuracy and reliability, particularly under minor input variations, with some dropping below 35% accuracy, whereas GPT-4 achieved above 95% reliability; (2) Security and Compliance: Significant concerns were identified, including privacy vulnerabilities, legal compliance risks, and persistent biases, which may hinder secure deployments in government contexts; (3)\u2009Task Avoidance: Certain models exhibited excessive caution, often avoiding responses to basic tasks like document classification and government-related inquiries, which restricts their usability. These findings highlight essential limitations and opportunities for improvement, contributing to the safe and effective application of LLMs in the government sector.\n                  <\/jats:p>","DOI":"10.1145\/3716854","type":"journal-article","created":{"date-parts":[[2025,2,13]],"date-time":"2025-02-13T13:42:53Z","timestamp":1739454173000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["The Evaluation Framework and Benchmark for Large Language Models in the Government Affairs Domain"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0009-0008-6838-7965","authenticated-orcid":false,"given":"Shuo","family":"Liu","sequence":"first","affiliation":[{"name":"China Mobile Research Institute, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0424-9965","authenticated-orcid":false,"given":"Lin","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing Big Data Centre, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-1796-1397","authenticated-orcid":false,"given":"Weidong","family":"Liu","sequence":"additional","affiliation":[{"name":"China Mobile Research Institute, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-5813-0157","authenticated-orcid":false,"given":"Jianfeng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Beijing Big Data Centre, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-4147-999X","authenticated-orcid":false,"given":"Donghui","family":"Gao","sequence":"additional","affiliation":[{"name":"China Mobile Research Institute, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3159-2785","authenticated-orcid":false,"given":"Xiaofeng","family":"Jia","sequence":"additional","affiliation":[{"name":"Beijing Big Data Centre, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2025,11,24]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","unstructured":"D. F. Engstrom D. E. Ho C. M. Sharkey and M. F. Cu\u00e9llar. 2020. Government by algorithm: Artificial intelligence in federal administrative agencies. NYU School of Law Public Law Research Paper No. 20\u201354.","DOI":"10.2139\/ssrn.3551505"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.giq.2021.101577"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2946204"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1093\/ijlit\/eaz001"},{"key":"e_1_3_2_6_2","unstructured":"S. Bubeck V. Chandrasekaran R. Eldan J. Gehrke E. Horvitz E. Kamar P. Lee Y. T. Lee Y. Li S. Lundberg et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv:2303.12712. Retrieved from https:\/\/arxiv.org\/abs\/2303.12712"},{"key":"e_1_3_2_7_2","unstructured":"L. Huang W. Yu W. Ma W. Zhong Z. Feng H. Wang Q. Chen W. Peng X. Feng B. Qin et al. 2023. A survey on hallucination in large language models: Principles taxonomy challenges and open questions. arXiv:2311.05232. Retrieved from https:\/\/arxiv.org\/abs\/2311.05232"},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","unstructured":"X. Li M. Liu S. Gao and W. Buntine. 2023. A survey on out-of-distribution evaluation of neural nlp models. arXiv:2306.15261. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2306.15261","DOI":"10.24963\/ijcai.2023\/749"},{"key":"e_1_3_2_9_2","unstructured":"H. Sun Z. Zhang J. Deng J. Cheng and M. Huang. 2023. Safety assessment of Chinese large language models. arXiv:2304.10436. Retrieved from https:\/\/arxiv.org\/abs\/2304.10436"},{"key":"e_1_3_2_10_2","unstructured":"J. Bulian M. S. Sch\u00e4fer A. Amini H. Lam M. Ciaramita B. Gaiarin M. C. H\u00fcbscher C. Buck N. G. Mede M. Leippold et al. 2023. Assessing large language models on climate information. arXiv:2310.02932. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2310.02932"},{"key":"e_1_3_2_11_2","unstructured":"P. Islam A. Kannappan D. Kiela R. Qian N. Scherrer and B. Vidgen. 2023. FinanceBench: A new benchmark for financial question answering. arXiv:2311.11944. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2311.11944"},{"key":"e_1_3_2_12_2","unstructured":"X. Wang G. H. Chen D. Song Z. Zhang Z. Chen Q. Xiao F. Jiang J. Li X. Wan B. Wang et al. 2023. Cmb: A comprehensive medical benchmark in Chinese. arXiv:2308.08833. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2308.08833"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/364128"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijinfomgt.2021.102401"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.3390\/joitmc7010071"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.asej.2019.05.002"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.techsoc.2020.101283"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2024.3349969"},{"key":"e_1_3_2_19_2","unstructured":"State Council of the People\u2019s Republic of China. 2017. Guidelines for government website development (State Office No. 47). Retrieved May 15 2017 from https:\/\/www.gov.cn\/zhengce\/content\/2017-06\/08\/content_5200760.htm"},{"key":"e_1_3_2_20_2","unstructured":"State Council of the People\u2019s Republic of China. Regulations of the People\u2019s Republic of China on government information disclosure. Retrieved from https:\/\/www.chinatax.gov.cn\/chinatax\/n810214\/n810641\/n810687\/c4347891\/content.html"},{"key":"e_1_3_2_21_2","unstructured":"Standardization Administration of China. 2024. Guidelines for the construction of knowledge base for government services hotline (GB\/T 44191-2024).Retrieved from https:\/\/openstd.samr.gov.cn\/bzgk\/gb\/newGbInfo?hcno=B7EE27983C0F8A6372BC5C9978CCB408"},{"key":"e_1_3_2_22_2","unstructured":"Standardization Administration of China. 2024. General requirements for online-offline integration of national government services platform (GB\/T 44193-2024).Retrieved from https:\/\/openstd.samr.gov.cn\/bzgk\/gb\/newGbInfo?hcno=9E9459D195AAAE08D3F588F8B332E74F"},{"key":"e_1_3_2_23_2","unstructured":"Standardization Administration of China. 2020. Information security technology\u2014Personal information security specification (GB\/T 35273-2020).Retrieved from https:\/\/openstd.samr.gov.cn\/bzgk\/gb\/newGbInfo?hcno=4568F276E0F8346EB0FBA097AA0CE05E"},{"key":"e_1_3_2_24_2","unstructured":"A. Yang B. Xiao B. Wang B. Zhang C. Bian C. Yin C. Lv D. Pan D. Wang D. Yan et al. 2023. Baichuan 2: Open large-scale language models. arXiv:2309.10305. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2308.08833"},{"key":"e_1_3_2_25_2","unstructured":"X. Wang X. Zhang Z. Luo Q. Sun Y. Cui J. Wang F. Zhang Y. Wang Z. Li Q. Yu et al. 2024. Emu3: Next-token prediction is all you need. arXiv:2409.18869. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2409.18869"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.giq.2022.101679"},{"key":"e_1_3_2_27_2","unstructured":"S. Zhang Q. Fang Z. Zhang Z. Ma Y. Zhou L. Huang M. Bu S. Gui Y. Chen X. Chen et al. 2023. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. arXiv:2306.10968. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2306.10968"},{"key":"e_1_3_2_28_2","unstructured":"Team GLM A. Zeng B. Xu B. Wang C. Zhang D. Yin D. Wang D. Rojas G. Feng H. Zhao H. Lai et al. 2024. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools. arXiv:2406.12793. Retrieved from\u00a0https:\/\/arxiv.org\/abs\/2406.12793"}],"container-title":["ACM Transactions on Intelligent Systems and Technology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3716854","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,24]],"date-time":"2025-11-24T15:12:37Z","timestamp":1763997157000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3716854"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,24]]},"references-count":27,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12,31]]}},"alternative-id":["10.1145\/3716854"],"URL":"https:\/\/doi.org\/10.1145\/3716854","relation":{},"ISSN":["2157-6904","2157-6912"],"issn-type":[{"value":"2157-6904","type":"print"},{"value":"2157-6912","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,24]]},"assertion":[{"value":"2024-02-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-26","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-24","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}