{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T12:35:42Z","timestamp":1772195742431,"version":"3.50.1"},"reference-count":38,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T00:00:00Z","timestamp":1772150400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"PIONIER AI\u2014Robot Autonom Pentru Afaceri","award":["338328"],"award-info":[{"award-number":["338328"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Systems"],"abstract":"<jats:p>Large Language Models (LLMs) are increasingly being adopted in business settings; however, there remains a shortage of evaluation tools that account for country-specific regulations, particularly for Romania\u2019s taxation and financial accounting requirements. RO-FIN-LLM is a benchmark designed to test how well LLMs handle Romania-specific regulatory question answering in taxation (including VAT regimes, income\/profit tax, microenterprise rules, and other obligations) and financial accounting (including journal entries\/monographs, amortization, provisions, and foreign exchange transactions). The benchmark contains questions curated by experts, each including the applicable regulatory time frames and the legal sources for the answers. Evaluation is performed in two protocols: closed-book and open-book with Retrieval Augmented Generation (RAG), using Tavily Search API. Evaluation metrics are represented by rubrics, namely correctness, legal citation quality, and clarity\/structure. A subset of answers produced by three models was additionally evaluated by 12 specialists in the financial-accounting domain. In this revision, we also describe a public release plan for the question schema, prompts, and evaluation scripts to support independent reproducibility.<\/jats:p>","DOI":"10.3390\/systems14030244","type":"journal-article","created":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T10:41:02Z","timestamp":1772188862000},"page":"244","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["RO-FIN-LLM: A Benchmark with LLM-as-a-Judge and Human Evaluators for Romanian Tax and Accounting"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-0364-6661","authenticated-orcid":false,"given":"Maria-Ecaterina","family":"Olariu","sequence":"first","affiliation":[{"name":"Faculty of Computer Science, University Alexandru Ioan Cuza, 700506 Iasi, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9898-1982","authenticated-orcid":false,"given":"Vlad-Gabriel","family":"Buinceanu","sequence":"additional","affiliation":[{"name":"Nexus Media Romania, 700285 Iasi, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1274-0914","authenticated-orcid":false,"given":"Cristian","family":"Simionescu","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science, University Alexandru Ioan Cuza, 700506 Iasi, Romania"},{"name":"Nexus Media Romania, 700285 Iasi, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5403-8050","authenticated-orcid":false,"given":"Octavian","family":"Dospinescu","sequence":"additional","affiliation":[{"name":"Faculty of Economics and Business Administration, University Alexandru Ioan Cuza, 700506 Iasi, Romania"}]},{"given":"R\u0103zvan","family":"Georgescu","sequence":"additional","affiliation":[{"name":"Nexus Media Romania, 700285 Iasi, Romania"}]},{"given":"Cezar","family":"Tudor","sequence":"additional","affiliation":[{"name":"Nexus Media Romania, 700285 Iasi, Romania"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3564-8440","authenticated-orcid":false,"given":"Adrian","family":"Iftene","sequence":"additional","affiliation":[{"name":"Faculty of Computer Science, University Alexandru Ioan Cuza, 700506 Iasi, Romania"}]},{"given":"Ana-Maria","family":"Bores","sequence":"additional","affiliation":[{"name":"Faculty of Economics and Business Administration, \u0218tefan cel Mare Suceava University, 720229 Suceava, Romania"}]}],"member":"1968","published-online":{"date-parts":[[2026,2,27]]},"reference":[{"key":"ref_1","unstructured":"Yang, X., Zang, S., Ren, Y., Peng, D., and Wen, Z. (2024). Evaluating Large Language Models on Financial Report Summarization: An Empirical Study. arXiv."},{"key":"ref_2","unstructured":"Organisation for Economic Co-Operation and Development (2025). Tax Administration, OECD."},{"key":"ref_3","unstructured":"Zhao, H., Liu, Z., Wu, Z., Li, Y., Yang, T., Shu, P., Xu, S., Dai, H., Zhao, L., and Mai, G. (2024). Revolutionizing finance with llms: An overview of applications and insights. arXiv."},{"key":"ref_4","unstructured":"Yang, C., Xu, C., and Qi, Y. (2024). Financial knowledge large language model. arXiv."},{"key":"ref_5","unstructured":"John, I. (2025). Cloud-Based ERP and AI Integration for Cost Optimization, ResearchGate."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"104","DOI":"10.34218\/IJCET_16_04_007","article-title":"Embedding AI in Erp Workflows: A New Paradigm for Intelligent Decision Support","volume":"16","author":"Ashok","year":"2025","journal-title":"Int. J. Comput. Eng. Technol."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"73342","DOI":"10.1109\/ACCESS.2025.3564133","article-title":"Implementing Generative AI Into ERP Software","volume":"13","author":"Sarferaz","year":"2025","journal-title":"IEEE Access"},{"key":"ref_8","first-page":"447","article-title":"The Rise of Generative AI Agents in Finance: Operational Disruption and Strategic Evolution","volume":"9","author":"Hettiarachchi","year":"2025","journal-title":"Int. J. Eng. Technol. Res. Manag."},{"key":"ref_9","unstructured":"Shahid, I. (2025). AI Business Strategies, ResearchGate."},{"key":"ref_10","unstructured":"Ipeirotis, P., and Zheng, H. (2025). Natural Language Interfaces for Databases: What Do Users Think?. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Darmawati, D., Jaafar, N.I., HS, R., Baja, H.K., Purisamya, A.J., Yolanda, A.M.W., Amir, B., and Juanda, M.R.P. (2025). The Role of Artificial Intelligence in Improving the Efficiency and Accuracy of Local Government Financial Reporting: A Systematic Literature Review. J. Risk Financ. Manag., 18.","DOI":"10.3390\/jrfm18110601"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Machucho, R., and Ortiz, D. (2025). The Impacts of Artificial Intelligence on Business Innovation: A Comprehensive Review of Applications, Organizational Challenges, and Ethical Considerations. Systems, 13.","DOI":"10.3390\/systems13040264"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. (2023). G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv.","DOI":"10.18653\/v1\/2023.emnlp-main.153"},{"key":"ref_14","unstructured":"Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. (2024). Llms-as-judges: A comprehensive survey on llm-based evaluation methods. arXiv."},{"key":"ref_15","unstructured":"Ho, X., Huang, J., Boudin, F., and Aizawa, A. (2025). LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA. arXiv."},{"key":"ref_16","unstructured":"Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., and Mann, G. (2023). Bloomberggpt: A large language model for finance. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Chen, Z., Li, S., Smiley, C., Ma, Z., Shah, S., and Wang, W.Y. (2022). Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv.","DOI":"10.18653\/v1\/2022.emnlp-main.421"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Hashemi, H., Eisner, J., Rosset, C., Van Durme, B., and Kedzie, C. (2024). LLM-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.745"},{"key":"ref_19","first-page":"46595","article-title":"Judging llm-as-a-judge with mt-bench and chatbot arena","volume":"36","author":"Zheng","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Srivastava, P., Malik, M., Gupta, V., Ganu, T., and Roth, D. (2024). Evaluating LLMs\u2019 Mathematical Reasoning in Financial Document Question Answering. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.231"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"D\u2019Souza, J., Giglou, H.B., and M\u00fcnch, Q. (2025). YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. arXiv.","DOI":"10.18653\/v1\/2025.acl-long.675"},{"key":"ref_22","unstructured":"Tandan, P. (2025). AI and Business Administration, ResearchGate."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"782","DOI":"10.1002\/asi.23062","article-title":"Good Debt or Bad Debt","volume":"65","author":"Malo","year":"2014","journal-title":"J. Assoc. Inf. Sci. Technol."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Shah, R., Chawla, K., Eidnani, D., Shah, A., Du, W., Chava, S., Raman, N., Smiley, C., Chen, J., and Yang, D. (2022). When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial Domain. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics.","DOI":"10.18653\/v1\/2022.emnlp-main.148"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Reddy, V., Koncel-Kedziorski, R., Lai, V.D., Krumdick, M., Lovering, C., and Tanner, C. (2024). Docfinqa: A long-context financial reasoning dataset. arXiv.","DOI":"10.18653\/v1\/2024.acl-short.42"},{"key":"ref_26","unstructured":"Bigeard, A., Nashold, L., Krishnan, R., and Wu, S. (2025). Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks. arXiv."},{"key":"ref_27","unstructured":"Son, G., Yoon, D., Suk, J., Aula-Blasco, J., Aslan, M., Kim, V.T., Islam, S.B., Prats-Cristi\u00e0, J., Tormo-Ba\u00f1uelos, L., and Kim, S. (2024). MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models. arXiv."},{"key":"ref_28","unstructured":"(2025, October 10). TaxEval (v2). Available online: https:\/\/www.vals.ai\/benchmarks\/tax_eval_v2-09-29-2025."},{"key":"ref_29","unstructured":"(2025, October 10). Finance Agent. Available online: https:\/\/www.vals.ai\/benchmarks\/finance_agent-09-29-2025."},{"key":"ref_30","unstructured":"(2025, October 10). Tavily Documentation. Available online: https:\/\/docs.tavily.com\/welcome."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wang, Y., and Xu, H. (2024). SRSA: A Cost-Efficient Strategy-Router Search Agent for Real-world Human-Machine Interactions. Proceedings of the 2024 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE.","DOI":"10.1109\/ICDMW65004.2024.00046"},{"key":"ref_32","unstructured":"Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., and Scialom, T. (2023, January 1\u20135). Gaia: A benchmark for general ai assistants. Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda."},{"key":"ref_33","unstructured":"Soni, A.B., Li, B., Wang, X., Chen, V., and Neubig, G. (2025). Coding Agents with Multimodal Browsing are Generalist Problem Solvers. arXiv."},{"key":"ref_34","unstructured":"(2025, October 10). CFA\u00ae Program Level II Exam. Available online: https:\/\/www.cfainstitute.org\/programs\/cfa-program\/candidate-resources\/level-ii-exam."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"276","DOI":"10.11613\/BM.2012.031","article-title":"Interrater reliability: The kappa statistic","volume":"22","author":"McHugh","year":"2012","journal-title":"Biochem. Med."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"155","DOI":"10.1016\/j.jcm.2016.02.012","article-title":"A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research","volume":"15","author":"Koo","year":"2016","journal-title":"J. Chiropr. Med."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"420","DOI":"10.1037\/0033-2909.86.2.420","article-title":"Intraclass correlations: Uses in assessing rater reliability","volume":"86","author":"Shrout","year":"1979","journal-title":"Psychol. Bull."},{"key":"ref_38","first-page":"15","article-title":"Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation","volume":"36","author":"Aroyo","year":"2015","journal-title":"AI Mag."}],"container-title":["Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-8954\/14\/3\/244\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T11:12:43Z","timestamp":1772190763000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-8954\/14\/3\/244"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,27]]},"references-count":38,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2026,3]]}},"alternative-id":["systems14030244"],"URL":"https:\/\/doi.org\/10.3390\/systems14030244","relation":{},"ISSN":["2079-8954"],"issn-type":[{"value":"2079-8954","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,27]]}}}