{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T18:56:14Z","timestamp":1778007374576,"version":"3.51.4"},"reference-count":17,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2025,7,5]],"date-time":"2025-07-05T00:00:00Z","timestamp":1751673600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Australia Linkage Project","award":["LP220200746"],"award-info":[{"award-number":["LP220200746"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Instruction tuning plays a pivotal role in aligning large language models with diverse tasks, yet its effectiveness hinges on the interplay of data quality, domain composition, and training strategies. This study moves beyond qualitative assessment to systematically quantify these factors through extensive experiments on data selection, data mixture, and training protocols. By quantifying performance trade-offs, we demonstrate that the implicit method SuperFiltering achieves an optimal balance, whereas explicit filters can induce capability conflicts. A fine-grained analysis of cross-domain interactions quantifies a near-linear competition between code and math, while showing that tool use data exhibits minimal interference. To mitigate these measured conflicts, we compare multi-task, sequential, and multi-stage training strategies, revealing that multi-stage training significantly reduces Conflict Rates while preserving domain expertise. Our findings culminate in a unified framework for optimizing instruction tuning, offering actionable, data-driven guidelines for balancing multi-domain performance and enhancing model generalization, thus advancing the field by providing a methodology to move from intuition to systematic optimization.<\/jats:p>","DOI":"10.3390\/computers14070264","type":"journal-article","created":{"date-parts":[[2025,7,7]],"date-time":"2025-07-07T04:43:37Z","timestamp":1751863417000},"page":"264","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["A Comprehensive Approach to Instruction Tuning for Qwen2.5: Data Selection, Domain Interaction, and Training Protocols"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1454-9304","authenticated-orcid":false,"given":"Xungang","family":"Gu","sequence":"first","affiliation":[{"name":"School of Information Technology, Deakin University, Melbourne, VIC 3125, Australia"}]},{"given":"Mengqi","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Information Technology, Deakin University, Melbourne, VIC 3125, Australia"}]},{"given":"Yangjie","family":"Tian","sequence":"additional","affiliation":[{"name":"Kexin Technology, Beijing 100012, China"}]},{"given":"Ning","family":"Li","sequence":"additional","affiliation":[{"name":"Kexin Technology, Beijing 100012, China"}]},{"given":"Jiaze","family":"Sun","sequence":"additional","affiliation":[{"name":"Kexin Technology, Beijing 100012, China"}]},{"given":"Jingfang","family":"Xu","sequence":"additional","affiliation":[{"name":"Kexin Technology, Beijing 100012, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2812-2192","authenticated-orcid":false,"given":"He","family":"Zhang","sequence":"additional","affiliation":[{"name":"Kexin Technology, Beijing 100012, China"}]},{"given":"Ruohua","family":"Xu","sequence":"additional","affiliation":[{"name":"Kexin Technology, Beijing 100012, China"}]},{"given":"Ming","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Information Technology, Deakin University, Melbourne, VIC 3125, Australia"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,5]]},"reference":[{"key":"ref_1","first-page":"1","article-title":"Lima: Less is More for Alignment","volume":"Volume 36","author":"Zhou","year":"2023","journal-title":"Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Li, M., Zhang, Y., Li, Z., Chen, J., Chen, L., Cheng, N., Wang, J., Zhou, T., and Xiao, J. (2023). From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning. arXiv.","DOI":"10.18653\/v1\/2024.naacl-long.421"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Dong, G., Yuan, H., Lu, K., Li, C., Xue, M., Liu, D., Wang, W., Yuan, Z., Zhou, C., and Zhou, J. (2023). How Abilities in Large Language Models Are Affected by Supervised Fine-tuning Data Composition. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.12"},{"key":"ref_4","unstructured":"Chen, J., Chen, Z., Wang, J., Zhou, K., Zhu, Y., Jiang, J., Min, Y., Zhao, W.X., Dou, Z., and Mao, J. (2024). Towards Effective and Efficient Continual Pre-training of Large Language Models. arXiv."},{"key":"ref_5","unstructured":"Lu, K., Yuan, H., Yuan, Z., Lin, R., Lin, J., Tan, C., Zhou, C., and Zhou, J. (2024, January 7\u201311). #instag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models. Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria."},{"key":"ref_6","unstructured":"Liu, W., Zeng, W., He, K., Jiang, Y., and He, J. (2023). What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Li, M., Zhang, Y., He, S., Li, Z., Zhao, H., Wang, J., Cheng, N., and Zhou, T. (2024). Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning. arXiv.","DOI":"10.18653\/v1\/2024.acl-long.769"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Gu, J., Yang, Z., Ding, C., Zhao, R., and Tan, F. (2024). CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models. arXiv.","DOI":"10.18653\/v1\/2024.emnlp-main.903"},{"key":"ref_9","unstructured":"Cao, Y., Kang, Y., Wang, C., and Sun, L. (2023). Instruction Mining: Instruction Data Selection for Tuning Large Language Models. arXiv."},{"key":"ref_10","unstructured":"Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, V., Tang, Z., Srinivasan, V., Zhou, T., and Huang, H. (2023). Alpagasus: Training a Better Alpaca with Fewer Data. arXiv."},{"key":"ref_11","unstructured":"Du, Q., Zong, C., and Zhang, J. (2023). Mods: Model-oriented Data Selection for Instruction Tuning. arXiv."},{"key":"ref_12","unstructured":"Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T.L., and Raja, A. (2021). Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv."},{"key":"ref_13","unstructured":"Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., and Le, Q.V. (2021). Finetuned Language Models Are Zero-Shot Learners. arXiv."},{"key":"ref_14","unstructured":"Xia, M., Malladi, S., Gururangan, S., Arora, S., and Chen, D. (2024). Less: Selecting Influential Data for Targeted Instruction Tuning. arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Ivison, H., Smith, N.A., Hajishirzi, H., and Dasigi, P. (2022). Data-efficient Finetuning Using Cross-task Nearest Neighbors. arXiv.","DOI":"10.18653\/v1\/2023.findings-acl.576"},{"key":"ref_16","unstructured":"Jang, J., Kim, S., Ye, S., Kim, D., Logeswaran, L., Lee, M., Lee, K., and Seo, M. (July, January 30). Exploring the Benefits of Training Expert Language Models over Instruction Tuning. Proceedings of the International Conference on Machine Learning, PMLR, Edmonton, AB, Canada."},{"key":"ref_17","first-page":"74764","article-title":"How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources","volume":"Volume 36","author":"Wang","year":"2023","journal-title":"Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/7\/264\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:05:11Z","timestamp":1760033111000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/7\/264"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,5]]},"references-count":17,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2025,7]]}},"alternative-id":["computers14070264"],"URL":"https:\/\/doi.org\/10.3390\/computers14070264","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,7,5]]}}}