{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T17:13:49Z","timestamp":1763486029349,"version":"3.45.0"},"reference-count":30,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T00:00:00Z","timestamp":1763424000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ministry of Economic Development of the Russian Federation","award":["IGK 000000C313925P4C0002","139-15-2025-010"],"award-info":[{"award-number":["IGK 000000C313925P4C0002","139-15-2025-010"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>The growing number of domain-specific machine learning benchmarks has driven methodological progress, yet real-world deployments require a different evaluation approach. Model-aware synthetic benchmarks, designed to emphasize failure modes of existing models, are proposed to address this need. However, evaluating already well-performing models presents a significant challenge, as the limited number of high-quality data points where they exhibit errors makes it difficult to obtain statistically reliable estimates. To address this gap, we proposed a two-step benchmark construction process: (i) using a genetic algorithm to augment the data points where data-driven models exhibit poor prediction quality; (ii) using a generative model to approximate the distribution of these points. We established a general formulation for such benchmark construction, which can be adapted to non-classical machine learning models. Our experimental study demonstrates that our approach enables the accurate evaluation of data-driven models for both regression and classification problems.<\/jats:p>","DOI":"10.3390\/make7040148","type":"journal-article","created":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T16:49:51Z","timestamp":1763484591000},"page":"148","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Model-Aware Automatic Benchmark Generation with Self-Error Instructions for Data-Driven Models"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5774-4076","authenticated-orcid":false,"given":"Kirill","family":"Zakharov","sequence":"first","affiliation":[{"name":"Research Center \u201cStrong Artificial Intelligence in Industry\u201d, ITMO University, Birzhevaya Liniya 14, Saint Petersburg 199034, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1588-8164","authenticated-orcid":false,"given":"Alexander","family":"Boukhanovsky","sequence":"additional","affiliation":[{"name":"Research Center \u201cStrong Artificial Intelligence in Industry\u201d, ITMO University, Birzhevaya Liniya 14, Saint Petersburg 199034, Russia"},{"name":"Netherlands Institute of Advanced Studies, Korte Spinhuissteeg 3, 1012 CG Amsterdam, The Netherlands"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,18]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"9058","DOI":"10.1038\/s41467-024-52900-7","article-title":"Benchmarking machine learning methods for synthetic lethality prediction in cancer","volume":"15","author":"Feng","year":"2024","journal-title":"Nat. Commun."},{"key":"ref_2","unstructured":"Maheshwari, G., Ivanov, D., and Haddad, K.E. (2024). Efficacy of synthetic data as a benchmark. arXiv."},{"key":"ref_3","unstructured":"Liu, Y., Khandagale, S., White, C., and Neiswanger, W. (2021). Synthetic benchmarks for scientific research in explainable machine learning. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1016\/j.future.2021.08.022","article-title":"Automated evolutionary approach for the design of composite machine learning pipelines","volume":"127","author":"Nikitin","year":"2022","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"106622","DOI":"10.1016\/j.knosys.2020.106622","article-title":"AutoML: A survey of the state-of-the-art","volume":"212","author":"He","year":"2021","journal-title":"Knowl.-Based Syst."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Truong, A., Walters, A., Goodsitt, J., Hines, K., Bruss, C.B., and Farivar, R. (2019, January 4\u20136). Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA.","DOI":"10.1109\/ICTAI.2019.00209"},{"key":"ref_7","first-page":"86435","article-title":"Benchmark data repositories for better benchmarking","volume":"37","author":"Longjohn","year":"2024","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1016\/j.wasman.2024.02.017","article-title":"CODD: A benchmark dataset for the automated sorting of construction and demolition waste","volume":"178","author":"Demetriou","year":"2024","journal-title":"Waste Manag."},{"key":"ref_9","unstructured":"Xia, C.S., Deng, Y., and Zhang, L. (2024). Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM. arXiv."},{"key":"ref_10","unstructured":"Mahdavi, H., Hashemi, A., Daliri, M., Mohammadipour, P., Farhadi, A., Malek, S., Yazdanifard, Y., Khasahmadi, A., and Honavar, V. (2025). Brains vs. bytes: Evaluating llm proficiency in olympiad mathematics. arXiv."},{"key":"ref_11","unstructured":"Shi, Q., Tang, M., Narasimhan, K., and Yao, S. (2024). Can Language Models Solve Olympiad Programming?. arXiv."},{"key":"ref_12","unstructured":"Zheng, Z., Cheng, Z., Shen, Z., Zhou, S., Liu, K., He, H., Li, D., Wei, S., Hao, H., and Yao, J. (2025). LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ang, Y., Huang, Q., Bao, Y., Tung, A.K.H., and Huang, Z. (2023). TSGBench: Time Series Generation Benchmark. arXiv.","DOI":"10.14778\/3632093.3632097"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Yao, B.M., Wang, Q., and Huang, L. (2025). Error-driven Data-efficient Large Multimodal Model Tuning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics.","DOI":"10.18653\/v1\/2025.acl-long.992"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Mendes, P., Romano, P., and Garlan, D. (2024). Error-driven uncertainty aware training. arXiv.","DOI":"10.3233\/FAIA240683"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"417","DOI":"10.1007\/s10852-005-9020-3","article-title":"Ensemble Learning Using Multi-Objective Evolutionary Algorithms","volume":"5","author":"Chandra","year":"2006","journal-title":"J. Math. Model. Algorithms"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1016\/j.inffus.2004.04.004","article-title":"Diversity creation methods: A survey and categorisation","volume":"6","author":"Brown","year":"2005","journal-title":"Inf. Fusion"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1007\/s42044-022-00100-1","article-title":"Hybrid stacked ensemble combined with genetic algorithms for diabetes prediction","volume":"5","author":"Abdollahi","year":"2022","journal-title":"Iran J. Comput. Sci."},{"key":"ref_19","unstructured":"Shojaee, P., Nguyen, N.H., Meidani, K., Farimani, A.B., Doan, K.D., and Reddy, C.K. (2025). LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models. arXiv."},{"key":"ref_20","unstructured":"Rafiei Oskooei, A., Babacan, M., Ya\u011fci, E., Alptekin, C., and Bu\u011fday, A. (2024, January 19\u201321). Beyond Synthetic Benchmarks: Assessing Recent LLMs for Code Generation. Proceedings of the 14th International Workshop on Computer Science and Engineering (WCSE 2024), Phuket Island, Thailand."},{"key":"ref_21","unstructured":"Ronchetti, E.M., and Huber, P.J. (2009). Robust Statistics, John Wiley & Sons."},{"key":"ref_22","first-page":"2825","article-title":"Scikit-learn: Machine Learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_23","unstructured":"Syswerda, G. (1989, January 1). Uniform crossover in genetic algorithms. Proceedings of the Third International Conference on Genetic Algorithms and Their Application, Morgan Kaufmann, San Mateo, CA, USA."},{"key":"ref_24","first-page":"188","article-title":"Tournament selection","volume":"1","author":"Blickle","year":"2000","journal-title":"Evol. Comput."},{"key":"ref_25","first-page":"2171","article-title":"DEAP: Evolutionary Algorithms Made Easy","volume":"13","author":"Fortin","year":"2012","journal-title":"J. Mach. Learn. Res."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. (2018, January 2\u20137). Film: Visual reasoning with a general conditioning layer. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.11671"},{"key":"ref_27","unstructured":"Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regularization. arXiv."},{"key":"ref_28","first-page":"1372","article-title":"Synthetic financial time series generation with regime clustering","volume":"14","author":"Zakharov","year":"2023","journal-title":"J. Adv. Inf. Technol"},{"key":"ref_29","unstructured":"Yuan, X., and Qiao, Y. (2024). Diffusion-ts: Interpretable diffusion for general time series generation. arXiv."},{"key":"ref_30","unstructured":"Liao, S., Ni, H., Szpruch, L., Wiese, M., Sabate-Vidales, M., and Xiao, B. (2020). Conditional sig-wasserstein gans for time series generation. arXiv."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/148\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T17:11:19Z","timestamp":1763485879000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/148"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,18]]},"references-count":30,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["make7040148"],"URL":"https:\/\/doi.org\/10.3390\/make7040148","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,18]]}}}