{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T04:34:10Z","timestamp":1780634050938,"version":"3.54.1"},"reference-count":47,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2025,1,27]],"date-time":"2025-01-27T00:00:00Z","timestamp":1737936000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,1,27]],"date-time":"2025-01-27T00:00:00Z","timestamp":1737936000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Nat Mach Intell"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Large language models (LLMs) have become increasingly capable, but their development often requires substantial computational resources. Although model merging has emerged as a cost-effective promising approach for creating new models by combining existing ones, it currently relies on human intuition and domain knowledge, limiting its potential. Here we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models such as a Japanese LLM with math reasoning capabilities. Surprisingly, our Japanese math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with substantially more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally aware Japanese vision\u2013language model generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese vision\u2013language models. This work not only contributes new state-of-the-art models back to the open-source community but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.<\/jats:p>","DOI":"10.1038\/s42256-024-00975-8","type":"journal-article","created":{"date-parts":[[2025,1,27]],"date-time":"2025-01-27T10:12:30Z","timestamp":1737972750000},"page":"195-204","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":37,"title":["Evolutionary optimization of model merging recipes"],"prefix":"10.1038","volume":"7","author":[{"given":"Takuya","family":"Akiba","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Makoto","family":"Shing","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yujin","family":"Tang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-9875-0906","authenticated-orcid":false,"given":"Qi","family":"Sun","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8765-8574","authenticated-orcid":false,"given":"David","family":"Ha","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2025,1,27]]},"reference":[{"key":"975_CR1","unstructured":"Goddard, C. O. mergekit. GitHub https:\/\/github.com\/arcee-ai\/mergekit (2024)."},{"key":"975_CR2","unstructured":"Labonne, M. Merge large language models with mergekit. Hugging Face Blog https:\/\/huggingface.co\/blog\/mlabonne\/merge-models (2024)."},{"key":"975_CR3","unstructured":"HuggingFace. Open llm leaderboard. Hugging Face Blog https:\/\/huggingface.co\/spaces\/HuggingFaceH4\/open_llm_leaderboard (2023)."},{"key":"975_CR4","unstructured":"Wortsman, M. et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning Vol. 162 (eds Chaudhuri, K. et al.) 23965\u201323998 (PMLR, 2022); https:\/\/proceedings.mlr.press\/v162\/wortsman22a.html"},{"key":"975_CR5","unstructured":"Ilharco, G. et al. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations (OpenReview.net, 2023); https:\/\/openreview.net\/forum?id=6t0Kwf8-jrj"},{"key":"975_CR6","unstructured":"Yadav, P., Tam, D., Choshen, L., Raffel, C. A. & Bansal, M. Ties-merging: resolving interference when merging models. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 7093\u20137115 (Curran Associates, 2023)."},{"key":"975_CR7","unstructured":"Yu, L., Yu, B., Yu, H., Huang, F. & Li, Y. Language models are Super Mario: absorbing abilities from homologous models as a free lunch. In International Conference on Machine Learning Vol. 235 (eds Salakhutdinov, R. et al.) 57755\u201357775 (PMLR, 2024); https:\/\/proceedings.mlr.press\/v235\/yu24p.html"},{"key":"975_CR8","unstructured":"Ainsworth, S. K., Hayase, J. & Srinivasa, S. S.Git re-basin: merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations (OpenReview.net, 2023)."},{"key":"975_CR9","first-page":"17703","volume":"35","author":"MS Matena","year":"2022","unstructured":"Matena, M. S. & Raffel, C. A. Merging models with fisher-weighted averaging. Adv. Neural Inf. Process. Syst. 35, 17703\u201317716 (2022).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"975_CR10","unstructured":"Hansen, N. in Towards a New Evolutionary Computation: Advances In the Estimation of Distribution Algorithms (eds Lozano, J. A. et al.) 75\u2013102 (Springer, 2006)."},{"key":"975_CR11","doi-asserted-by":"crossref","unstructured":"Geva, M., Caciularu, A., Wang, K. R. & Goldberg, Y. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proc. 2021 Conference on Empirical Methods in Natural Language Processing (eds Goldberg, Y., Kozareva, Z., Zhang, Y.) 30\u201345 (Association for Computational Linguistics, 2022).","DOI":"10.18653\/v1\/2022.emnlp-main.3"},{"key":"975_CR12","unstructured":"nostalgebraist. Interpreting gpt: the logit lens. LessWrong https:\/\/www.lesswrong.com\/posts\/AcKRB8wDpdaN6v6ru\/interpreting-gpt-the-logit-lens (2021)."},{"key":"975_CR13","first-page":"17359","volume":"35","author":"K Meng","year":"2022","unstructured":"Meng, K., Bau, D., Andonian, A. & Belinkov, Y. Locating and editing factual associations in gpt. Adv. Neural Inf. Process. Syst. 35, 17359\u201317372 (2022).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"975_CR14","unstructured":"Sun, Q., Pickett, M., Nain, A. K. & Jones, L. Transformer layers as painters. Preprint at https:\/\/arxiv.org\/abs\/2407.09298 (2024)."},{"key":"975_CR15","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1162\/neco.1992.4.1.131","volume":"4","author":"J Schmidhuber","year":"1992","unstructured":"Schmidhuber, J. Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Comput. 4, 131\u2013139 (1992).","journal-title":"Neural Comput."},{"key":"975_CR16","unstructured":"Ha, D., Dai, A. & Le, Q. V. Hypernetworks. In International Conference on Learning Representations (OpenReview.net, 2017); https:\/\/openreview.net\/forum?id=rkpACe1lx"},{"key":"975_CR17","doi-asserted-by":"publisher","first-page":"182","DOI":"10.1109\/4235.996017","volume":"6","author":"K Deb","year":"2002","unstructured":"Deb, K., Pratap, A., Agarwal, S. & Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comp. 6, 182\u2013197 (2002).","journal-title":"IEEE Trans. Evol. Comp."},{"key":"975_CR18","unstructured":"Shi, F. et al. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations (OpenReview.net, 2023); https:\/\/openreview.net\/pdf?id=fR3wGCk-IXp"},{"key":"975_CR19","unstructured":"Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https:\/\/arxiv.org\/abs\/2110.14168 (2021)."},{"key":"975_CR20","unstructured":"Li, J., Li, D., Savarese, S. & Hoi, S. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning Vol. 202 (eds Krause, A. et al.) 19730\u201319742 (PMLR, 2023); https:\/\/proceedings.mlr.press\/v202\/li23q.html"},{"key":"975_CR21","unstructured":"Dai, W. et al. Instructblip: towards general-purpose vision\u2013language models with instruction tuning. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 49250\u201349267 (Curran Associates, 2023)."},{"key":"975_CR22","unstructured":"Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. In Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 34891\u201334916 (Curran Associates, 2023)."},{"key":"975_CR23","doi-asserted-by":"crossref","unstructured":"Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In Proc. IEEE\/CVF Conference on Computer Vision and Pattern Recognition 26296\u201326306 (2024).","DOI":"10.1109\/CVPR52733.2024.02484"},{"key":"975_CR24","unstructured":"Bai, J. et al. Qwen-vl: a versatile vision\u2013language model for understanding, localization, text reading, and beyond. Preprint at https:\/\/arxiv.org\/abs\/2308.12966 (2023)."},{"key":"975_CR25","unstructured":"Labonne, M. Automerger experiment. Twitter https:\/\/twitter.com\/maximelabonne\/status\/1767124527551549860 (2024)."},{"key":"975_CR26","unstructured":"White, T. Sampling generative networks. Preprint at https:\/\/arxiv.org\/abs\/1609.04468 (2016)."},{"key":"975_CR27","unstructured":"AI, S. Evosdxl-jp-v1. sakana.ai https:\/\/sakana.ai\/evosdxl-jp\/ (2024)."},{"key":"975_CR28","unstructured":"Lin, S., Wang, A. & Yang, X. Sdxl-lightning: progressive adversarial diffusion distillation. Preprint at https:\/\/arxiv.org\/abs\/2402.13929 (2024)."},{"key":"975_CR29","unstructured":"AI, S. Evovlm-jp-v2. sakana.ai https:\/\/sakana.ai\/evovlm-jp\/ (2024)."},{"key":"975_CR30","unstructured":"AI, S. Evoukiyoe. sakana.ai https:\/\/sakana.ai\/evo-ukiyoe\/ (2024)."},{"key":"975_CR31","doi-asserted-by":"publisher","unstructured":"Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: a next-generation hyperparameter optimization framework. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2623\u20132631 (Association for Computing Machinery, 2019); https:\/\/doi.org\/10.1145\/3292500.3330701","DOI":"10.1145\/3292500.3330701"},{"key":"975_CR32","unstructured":"augmxnt. shisa-gamma-7b. Hugging Face https:\/\/hf.co\/augmxnt\/shisa-gamma-7b-v1 (2023)."},{"key":"975_CR33","unstructured":"Luo, H. et al. Wizardmath: empowering mathematical reasoning for large language models via reinforced evol-instruct. Preprint at https:\/\/arxiv.org\/abs\/2308.09583 (2023)."},{"key":"975_CR34","unstructured":"Chern, E. et al. Generative ai for math: Abel. GitHub https:\/\/github.com\/GAIR-NLP\/abel (2023)."},{"key":"975_CR35","unstructured":"Jiang, A. Q. et al. Mistral 7b. Preprint at https:\/\/arxiv.org\/abs\/2310.06825 (2023)."},{"key":"975_CR36","doi-asserted-by":"crossref","unstructured":"Joulin, A., Grave, E., Bojanowski, P. & Mikolov, T. Bag of tricks for efficient text classification. In Proc. 15th Conference of the European Chapter of the Association for Computational Linguistics (eds Lapata, M., Blunsom, P. & Koller, A.) 427\u2013431 (Association for Computational Linguistics, 2017).","DOI":"10.18653\/v1\/E17-2068"},{"key":"975_CR37","unstructured":"Joulin, A. et al. Fasttext.zip: compressing text classification models. Preprint at https:\/\/arxiv.org\/abs\/1612.03651 (2016)."},{"key":"975_CR38","unstructured":"AI, S. Jp language model evaluation harness. GitHub https:\/\/github.com\/Stability-AI\/lm-evaluation-harness\/tree\/jp-stable (2024)."},{"key":"975_CR39","doi-asserted-by":"publisher","unstructured":"Gao, L. et al. A framework for few-shot language model evaluation. Zenodo https:\/\/doi.org\/10.5281\/zenodo.14506035 (2023).","DOI":"10.5281\/zenodo.14506035"},{"key":"975_CR40","unstructured":"AI, S. Japanese stable lm beta. stability.ai https:\/\/ja.stability.ai\/blog\/japanese-stable-lm-beta (2024)."},{"key":"975_CR41","unstructured":"rinna. Lm benchmark. GitHub https:\/\/rinnakk.github.io\/research\/benchmarks\/lm\/index.html (2024)."},{"key":"975_CR42","doi-asserted-by":"crossref","unstructured":"Tang, Y., Tian, Y., Ha, Da. EvoJAX: hardware-accelerated neuroevolution. In Proc. the Genetic and Evolutionary Computation Conference Companion 308\u2013311 (Association for Computing Machinery, 2022).","DOI":"10.1145\/3520304.3528770"},{"key":"975_CR43","unstructured":"Liu, H. et al. Llava-next: improved reasoning, ocr, and world knowledge. LLaVA https:\/\/llava-vl.github.io\/blog\/2024-01-30-llava-next\/ (2024)."},{"key":"975_CR44","unstructured":"Shimizu, N., Rong, N. & Miyazaki, T. Visual question answering dataset for bilingual image understanding: a study of cross-lingual transfer using attention maps. In Proc. 27th International Conference on Computational Linguistics, 1918\u20131928 (Association for Computational Linguistics, 2018); http:\/\/aclweb.org\/anthology\/C18-1163"},{"key":"975_CR45","unstructured":"OpenAI. Gpt-4v(ision) system card. https:\/\/cdn.openai.com\/papers\/GPTV_System_Card.pdf (OpenAI, 2023)."},{"key":"975_CR46","unstructured":"Shing, M. & Akiba, T. Japanese stable vlm. Hugging Face https:\/\/huggingface.co\/stabilityai\/japanese-stable-vlm (2023)."},{"key":"975_CR47","doi-asserted-by":"publisher","unstructured":"Akiba, T., Shing, M., Tang, Y., Sun, Q. & Ha, D. Sakanaai\/evolutionary-model-merge: v0.1.0 Zenodo https:\/\/doi.org\/10.5281\/zenodo.14241914 (2024).","DOI":"10.5281\/zenodo.14241914"}],"container-title":["Nature Machine Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00975-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00975-8","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00975-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,23]],"date-time":"2025-02-23T23:03:40Z","timestamp":1740351820000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/s42256-024-00975-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,27]]},"references-count":47,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,2]]}},"alternative-id":["975"],"URL":"https:\/\/doi.org\/10.1038\/s42256-024-00975-8","relation":{},"ISSN":["2522-5839"],"issn-type":[{"value":"2522-5839","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,1,27]]},"assertion":[{"value":"22 April 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 December 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 January 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}