{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,27]],"date-time":"2025-06-27T09:10:01Z","timestamp":1751015401648,"version":"3.41.0"},"reference-count":51,"publisher":"Springer Science and Business Media LLC","issue":"19","license":[{"start":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T00:00:00Z","timestamp":1745884800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T00:00:00Z","timestamp":1745884800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100015732","name":"Bahcesehir University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100015732","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Comput &amp; Applic"],"published-print":{"date-parts":[[2025,7]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>The attention mechanism is the primary component of the transformer architecture; it has led to significant advancements in deep learning spanning many domains and covering multiple tasks. In computer vision, the attention mechanism was first incorporated in the vision transformer (ViT), and then its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While the attention mechanism is very expressive and capable, it comes with the disadvantage of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perceiver-IO, and many more attempts with different sets of advantages and disadvantages. This paper introduces a new computational block as an alternative to the standard ViT block. The newly proposed block reduces the computational requirements by replacing the normal attention layers with a network in network structure, therefore enhancing the static approach of the MLP-Mixer with a dynamic learning of element-wise gating function generated by a token-mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain. <\/jats:p>","DOI":"10.1007\/s00521-025-11226-1","type":"journal-article","created":{"date-parts":[[2025,4,29]],"date-time":"2025-04-29T15:36:20Z","timestamp":1745940980000},"page":"13411-13428","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["NiNformer: a network in network transformer with token mixing\u00a0generated gating function"],"prefix":"10.1007","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1757-0785","authenticated-orcid":false,"given":"Abdullah Nazhat","family":"Abdullah","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2018-405X","authenticated-orcid":false,"given":"Tarkan","family":"Aydin","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,4,29]]},"reference":[{"key":"11226_CR1","first-page":"5998","volume":"30","author":"A Vaswani","year":"2017","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998\u20136008","journal-title":"Adv Neural Inf Process Syst"},{"key":"11226_CR2","first-page":"1877","volume":"33","author":"TB Brown","year":"2020","unstructured":"Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Gretchen Krueger TJ, Henighan RC, Ramesh A, Ziegler DM, Jeff W, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language Models are Few-Shot Learners. Adv Neural Inf Process Syst 33:1877\u20131901","journal-title":"Adv Neural Inf Process Syst"},{"key":"11226_CR3","unstructured":"Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training"},{"key":"11226_CR4","unstructured":"Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozi\u00e8re B, Goyal N, Hambro E, Azhar F, Rodriguez A (2023) LLaMA: open and efficient foundation language models. arXiv:abs\/2302.13971"},{"key":"11226_CR5","unstructured":"PPenedo G, Malartic Q, Hesslow D, Cojocaru R, Cappelli A, Alobeidli H, Pannier B, Almazrouei E, Launay J (2023) The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. arXiv:abs\/2306.01116"},{"key":"11226_CR6","unstructured":"Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Casas DL, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux MA, Stock P, Scao TL, Lavril T, Wang T, Lacroix T, Sayed William El (2023) Mistral 7B. arXiv:abs\/2310.06825"},{"key":"11226_CR7","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:abs\/2010.11929"},{"key":"11226_CR8","unstructured":"Tolstikhin IO, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M (2021) MLP-mixer: an all-MLP architecture for vision. Proceedings of the 35th international conference on neural information processing systems, pp 24261-24272"},{"key":"11226_CR9","unstructured":"Trockman A, Kolter JZ (2022) Patches Are All You Need?, Trans Mach Learn Res 2023"},{"key":"11226_CR10","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin Transformer: hierarchical Vision Transformer using Shifted Windows. 2021 IEEE\/CVF international conference on computer vision (ICCV), pp 9992-10002","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"11226_CR11","doi-asserted-by":"crossref","unstructured":"Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. ECCV 2020, pp 213-229","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"11226_CR12","unstructured":"Jaegle A, Borgeaud S, Alayrac JB, Doersch C, Ionescu C, Ding D, Koppula S, Brock A, Shelhamer E, H\u2019enaff OJ, Botvinick MM, Andrew Z, Vinyals O, Carreira J (2021) Perceiver IO: a general architecture for structured inputs & outputs (ICLR) arXiv:abs\/2107.14795"},{"key":"11226_CR13","unstructured":"Lu J, Clark C, Zellers R, Mottaghi R, Kembhavi A (2022) Unified-IO: a unified model for vision, language, and multi-modal tasks (ICLR) arXiv:abs\/2206.08916"},{"key":"11226_CR14","unstructured":"Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J, Ni LM, Shum HY (2022) DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv: abs\/2203.03605"},{"key":"11226_CR15","doi-asserted-by":"crossref","unstructured":"Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, Doll\u00e1r P (2023) Segment Anything. 2023 IEEE\/CVF international conference on computer vision (ICCV), pp 4015-4026","DOI":"10.1109\/ICCV51070.2023.00371"},{"key":"11226_CR16","unstructured":"Wang S, Li BZ, Khabsa M, Fang H, Ma H (2020) Linformer: self-attention with linear complexity. arXiv:abs\/2006.04768"},{"key":"11226_CR17","doi-asserted-by":"crossref","unstructured":"LLee-Thorp J, Ainslie J, Eckstein I, Ontanon S (2021) FNet: mixing tokens with fourier transforms. Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: human language technologies, pp 4296-4313","DOI":"10.18653\/v1\/2022.naacl-main.319"},{"key":"11226_CR18","doi-asserted-by":"crossref","unstructured":"Li Y, Zhang K, Cao J, Timofte R, Magno M, Benini L, Van Goo L (2021) LocalViT: analyzing locality in vision transformers. 2023 IEEE\/RSJ international conference on intelligent robots and systems (IROS), pp 9598-9605","DOI":"10.1109\/IROS55552.2023.10342025"},{"key":"11226_CR19","doi-asserted-by":"crossref","unstructured":"Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y (2022) Maxvit: multi-axis vision transformer, European conference on computer vision, pp 459-479","DOI":"10.1007\/978-3-031-20053-3_27"},{"issue":"16","key":"11226_CR20","doi-asserted-by":"publisher","first-page":"14138","DOI":"10.1609\/aaai.v35i16.17664","volume":"35","author":"X Yunyang","year":"2021","unstructured":"Yunyang X, Zhanpeng Z, Rudrasis C, Mingxing T, Moo FG, Yin L, Vikas S (2021) Nystr\u00f6mformer: A Nystr\u00f6m-Based Algorithm for Approximating Self-Attention. Proceedings of the AAAI Conference on Artificial Intelligence 35(16):14138\u201314148","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"11226_CR21","unstructured":"Keles FD, Wijewardena PM, Hegde C (2022) On the computational complexity of self-attention. International conference on algorithmic learning theory, pp 597-619"},{"key":"11226_CR22","unstructured":"Lin M, Chen Q, Yan S (2013) Network in network. arXiv:abs\/1312.4400"},{"key":"11226_CR23","doi-asserted-by":"publisher","first-page":"111","DOI":"10.1016\/j.aiopen.2022.10.001","volume":"3","author":"T Lin","year":"2021","unstructured":"Lin T, Wang Y, Liu X, Qiu X (2021) A Survey of Transformers. AI Open 3:111\u2013132","journal-title":"AI Open"},{"key":"11226_CR24","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3530811","volume":"55","author":"Y Tay","year":"2020","unstructured":"Tay Y, Dehghani M, Bahri D, Metzler D (2020) Efficient transformers: a survey. ACM Comput Surv 55:1\u201328","journal-title":"ACM Comput Surv"},{"key":"11226_CR25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3586074","volume":"55","author":"Q Fournier","year":"2021","unstructured":"Fournier Q, Caron GM, Aloise D (2021) A Practical Survey on Faster and Lighter Transformers. ACM Comput Surv 55:1\u201340","journal-title":"ACM Comput Surv"},{"key":"11226_CR26","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3505244","volume":"54","author":"S Khan","year":"2021","unstructured":"Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021) Transformers in Vision: A Survey. ACM Computing Surveys (CSUR) 54:1\u201341","journal-title":"ACM Computing Surveys (CSUR)"},{"key":"11226_CR27","doi-asserted-by":"crossref","unstructured":"Guo Q, Qiu X, Liu P, Shao Y, Xue X, Zhang Z (2019) Star-transformer. Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies 1, pp 1315-1325","DOI":"10.18653\/v1\/N19-1133"},{"key":"11226_CR28","unstructured":"Beltagy I, Peters ME, Cohan A (2020) Longformer: the long-document transformer. arXiv:abs\/2004.05150"},{"key":"11226_CR29","unstructured":"Kitaev N, Kaiser \u0141, Levskaya A (2020) Reformer: the efficient transformer. arXiv: abs\/2001.04451"},{"key":"11226_CR30","unstructured":"Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontanon S, Pham P, Ravula A, Wang Q, Yang L, Ahmed A (2020) Big Bird: transformers for longer sequences. Proceedings of the 34th international conference on neural information processing systems, pp 17283-17297"},{"key":"11226_CR31","unstructured":"Katharopoulos A, Vyas A, Pappas N, Fleuret F (2020) Transformers are RNNs: Fast autoregressive transformers with linear attention. International Conference on Machine Learning, pp 5156-5165"},{"key":"11226_CR32","unstructured":"Choromanski K, Likhosherstov V, Dohan D, Song X, Gane A, Sarl\u00f3s T, Hawkins P, Davis J, Mohiuddin A, Kaiser L, Belanger D, Colwell LJ, Weller A (2020) Rethinking attention with performers. arXiv:abs\/2009.14794"},{"key":"11226_CR33","unstructured":"Tay Y, Bahri D, Yang L, Metzler D, Juan DC (2020) Sparse sinkhorn attention. International conference on machine learning, pp 9438-9447"},{"key":"11226_CR34","unstructured":"Wang C, Ye Z, Zhang A, Zhang Z, Smola A (2020) Transformer on a Diet. arXiv: abs\/2002.06170"},{"key":"11226_CR35","unstructured":"Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang YX, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Proceedings of the 33rd international conference on neural information processing systems, pp 5243-5253"},{"key":"11226_CR36","first-page":"2555","volume":"2020","author":"J Qiu","year":"2019","unstructured":"Qiu J, Ma H, Levy O, Yih S, Wang S, Tang J (2019) Blockwise self-attention for long document understanding. Findings of the Association for Computational Linguistics: EMNLP 2020:2555\u20132565","journal-title":"Findings of the Association for Computational Linguistics: EMNLP"},{"key":"11226_CR37","doi-asserted-by":"crossref","unstructured":"Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-XL: attentive language models beyond a fixed-length context. Proceedings of the 57th annual meeting of the association for computational linguistics, pp 2978-2988","DOI":"10.18653\/v1\/P19-1285"},{"key":"11226_CR38","unstructured":"Vyas A, Katharopoulos A, Fleuret F (2020) Fast transformers with clustered attention. Proceedings of the 34th international conference on neural information processing systems, pp 21665-21674"},{"key":"11226_CR39","unstructured":"ZZhang H, Gong Y, Shen Y, Li W, Lv J, Duan N, Chen W (2021) Poolingformer: long document modeling with pooling attention. arXiv:abs\/2105.04371"},{"key":"11226_CR40","unstructured":"LLiu PJ, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N (2018) Generating wikipedia by summarizing long sequences. arXiv:abs\/1801.10198"},{"key":"11226_CR41","unstructured":"Dai Zihang, Lai Guokun, Yang Yiming, Le Quoc V (2020) Funnel-transformer: filtering out sequential redundancy for efficient language processing. Proceedings of the 34th international conference on neural information processing systems, pp 4271 - 4282"},{"key":"11226_CR42","unstructured":"Ho J, Kalchbrenner N, Weissenborn D, Salimans T (2019) Axial attention in multidimensional transformers"},{"key":"11226_CR43","first-page":"9204","volume":"34","author":"H Liu","year":"2021","unstructured":"Liu H, Dai Z, So D, Quoc VL (2021) Pay attention to mlps. Adv Neural Inf Process Syst 34:9204\u20139215","journal-title":"Adv Neural Inf Process Syst"},{"key":"11226_CR44","unstructured":"Tay Y, Bahri D, Metzler D, Juan DC, Zhao Z, Zheng C (2020) Synthesizer: rethinking self-attention for transformer models. International conference on machine learning, pp 10183-10192"},{"key":"11226_CR45","first-page":"15908","volume":"34","author":"K Han","year":"2021","unstructured":"Han K, Xiao A, Enhua W, Guo J, Chunjing X, Wang Y (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908\u201315919","journal-title":"Adv Neural Inf Process Syst"},{"key":"11226_CR46","unstructured":"De S, Smith SL, Fernando A, Botev A, Cristian-Muraru G, Gu A, Haroun R, Berrada L, Chen Y, Srinivasan S, Desjardins G, Doucet A, Budden D, Teh YW, Pascanu R, de Freitas N, Gulcehre C (2024) Griffin: mixing gated linear recurrences with local attention for efficient language models. arXiv:abs\/2402.19427"},{"key":"11226_CR47","first-page":"933","volume":"70","author":"Y Dauphin","year":"2016","unstructured":"Dauphin Y, Fan A, Auli M, Grangier D (2016) Language Modeling with Gated Convolutional Networks. International Conference on Machine Learning 70:933\u2013941","journal-title":"International Conference on Machine Learning"},{"key":"11226_CR48","unstructured":"Krizhevsky A, Nair V, Hinton G (2009) Cifar-10 and cifar-100 datasets. https:\/\/www.cs.toronto.du\/kriz\/cifar.html. 6(1)"},{"key":"11226_CR49","doi-asserted-by":"publisher","first-page":"2278","DOI":"10.1109\/5.726791","volume":"86","author":"Y LeCun","year":"1998","unstructured":"LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278\u20132324","journal-title":"Proc IEEE"},{"key":"11226_CR50","unstructured":"Steiner A, Kolesnikov A, Zhai X, Wightman R, Uszkoreit J, Beyer L (2021) How to train your vit? data, augmentation, and regularization in vision transformers arXiv:abs\/2106.10270"},{"key":"11226_CR51","doi-asserted-by":"crossref","unstructured":"Lee S, Phanishayee A, Mahajan D (2024) Forecasting GPU performance for deep learning training and inference. International conference on architectural support for programming languages and operating systems, pp 493-508","DOI":"10.1145\/3669940.3707265"}],"container-title":["Neural Computing and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-025-11226-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00521-025-11226-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00521-025-11226-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,27]],"date-time":"2025-06-27T08:28:20Z","timestamp":1751012900000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00521-025-11226-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,29]]},"references-count":51,"journal-issue":{"issue":"19","published-print":{"date-parts":[[2025,7]]}},"alternative-id":["11226"],"URL":"https:\/\/doi.org\/10.1007\/s00521-025-11226-1","relation":{},"ISSN":["0941-0643","1433-3058"],"issn-type":[{"type":"print","value":"0941-0643"},{"type":"electronic","value":"1433-3058"}],"subject":[],"published":{"date-parts":[[2025,4,29]]},"assertion":[{"value":"15 March 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 March 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 April 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that there is no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}