{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T18:23:02Z","timestamp":1772302982186,"version":"3.50.1"},"reference-count":20,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2024,11,12]],"date-time":"2024-11-12T00:00:00Z","timestamp":1731369600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"State Scholarships Foundation (IKY), Greece"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Recent studies have shown that, due to redundancy, some heads of the Transformer model can be pruned without diminishing the efficiency of the model. In this paper, we propose a constrained optimization algorithm based on Hebbian learning, which trains specific layers in the Transformer architecture in order to enforce diversification between the different heads in the multi-head attention module. The diversification of the heads is achieved through a single-layer feed-forward neural network that is added to the Transformer architecture and is trained with the proposed algorithm. We utilize the algorithm in three different architectural variations of the baseline Transformer model. In addition to the diversification of the heads, the proposed methodology can be used to prune the heads that capture redundant information. Experiments on diverse NLP tasks, including machine translation, text summarization, question answering and large language modeling, show that our proposed approach consistently improves the performance of baseline Transformer models.<\/jats:p>","DOI":"10.3390\/make6040126","type":"journal-article","created":{"date-parts":[[2024,11,12]],"date-time":"2024-11-12T09:36:51Z","timestamp":1731404211000},"page":"2618-2638","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Diversifying Multi-Head Attention in the Transformer Model"],"prefix":"10.3390","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7238-4731","authenticated-orcid":false,"given":"Nicholas","family":"Ampazis","sequence":"first","affiliation":[{"name":"Department of Financial and Management Engineering, University of the Aegean, 82100 Chios, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6737-1661","authenticated-orcid":false,"given":"Flora","family":"Sakketou","sequence":"additional","affiliation":[{"name":"Department of Financial and Management Engineering, University of the Aegean, 82100 Chios, Greece"}]}],"member":"1968","published-online":{"date-parts":[[2024,11,12]]},"reference":[{"key":"ref_1","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2023). Attention Is All You Need. arXiv."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Raganato, A., and Tiedemann, J. (2018, January 1). An Analysis of Encoder Representations in Transformer-Based Machine Translation. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium.","DOI":"10.18653\/v1\/W18-5431"},{"key":"ref_3","unstructured":"Michel, P., Levy, O., and Neubig, G. (2019). Are Sixteen Heads Really Better than One?. arXiv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv.","DOI":"10.18653\/v1\/P19-1580"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Raganato, A., Scherrer, Y., and Tiedemann, J. (2020). Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. arXiv.","DOI":"10.18653\/v1\/2020.findings-emnlp.49"},{"key":"ref_6","unstructured":"Tay, Y., Bahri, D., Metzler, D., Juan, D.C., Zhao, Z., and Zheng, C. (2020). Synthesizer: Rethinking Self-Attention in Transformer Models. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Peng, H., Schwartz, R., Li, D., and Smith, N.A. (2020, January 5\u201310). A Mixture of h - 1 Heads is Better than h Heads. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.587"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1162\/neco.1991.3.1.79","article-title":"Adaptive Mixtures of Local Experts","volume":"3","author":"Jacobs","year":"1991","journal-title":"Neural Comput."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1016\/S0893-6080(98)00103-8","article-title":"Dynamics of Multilayer Networks in the Vicinity of Temporary Minima","volume":"12","author":"Ampazis","year":"1999","journal-title":"Neural Netw."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"459","DOI":"10.1016\/0893-6080(89)90044-0","article-title":"Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network","volume":"2","author":"Sanger","year":"1989","journal-title":"Neural Netw."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"927","DOI":"10.1016\/S0893-6080(05)80089-9","article-title":"Principal Components, Minor Components, and Linear Neural Networks","volume":"5","author":"Oja","year":"1992","journal-title":"Neural Netw."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1420","DOI":"10.1109\/72.471365","article-title":"An Efficient Constrained Training Algorithm for Feedforward Networks","volume":"6","author":"Karras","year":"1995","journal-title":"IEEE Trans. Neural Netw."},{"key":"ref_13","unstructured":"Ioffe, S., and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"James, G., Witten, D., Hastie, T., and Tibshirani, R. (2014). An Introduction to Statistical Learning: With Applications in R, Springer.","DOI":"10.1007\/978-1-4614-7138-7"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Elliott, D., Frank, S., Sima\u2019an, K., and Specia, L. (2016). Multi30K: Multilingual English-German Image Descriptions. arXiv.","DOI":"10.18653\/v1\/W16-3210"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Narayan, S., Cohen, S.B., and Lapata, M. (November, January 31). Don\u2019t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1206"},{"key":"ref_17","unstructured":"Su, J., Duh, K., and Carreras, X. (2016, January 1\u20135). SQuAD: 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA."},{"key":"ref_18","unstructured":"Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J.R., Hestness, J., and Dey, N. (2023, June 09). SlimPajama: A 627B Token Cleaned and Deduplicated Version of RedPajama. Available online: https:\/\/cerebras.ai\/blog\/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama."},{"key":"ref_19","unstructured":"Ostapenko, O., Lesort, T., Rodriguez, P., Arefin, M.R., Douillard, A., Rish, I., and Charlin, L. (2022, January 22\u201324). Continual learning with foundation models: An empirical study of latent replay. Proceedings of the Conference on Lifelong Learning Agents, PMLR, Montr\u00e9al, QC, Canada."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., and Phang, J. (2022). GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv.","DOI":"10.18653\/v1\/2022.bigscience-1.9"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/126\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:30:52Z","timestamp":1760113852000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/6\/4\/126"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,12]]},"references-count":20,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,12]]}},"alternative-id":["make6040126"],"URL":"https:\/\/doi.org\/10.3390\/make6040126","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,12]]}}}