{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T15:17:31Z","timestamp":1777562251910,"version":"3.51.4"},"reference-count":42,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2024,2,16]],"date-time":"2024-02-16T00:00:00Z","timestamp":1708041600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,2,16]],"date-time":"2024-02-16T00:00:00Z","timestamp":1708041600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["52275091"],"award-info":[{"award-number":["52275091"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100005047","name":"Natural Science Foundation of Liaoning Province","doi-asserted-by":"publisher","award":["2022-MS-125"],"award-info":[{"award-number":["2022-MS-125"]}],"id":[{"id":"10.13039\/501100005047","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"publisher","award":["N2303011"],"award-info":[{"award-number":["N2303011"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Neural Process Lett"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>In the transformer architecture, as self-attention reads entire image patches at once, the context of the sequence between patches is omitted. Therefore, the position embedding method is employed to assist the self-attention layers in computing the ordering information of tokens. While many papers simply add the position vector to the corresponding token vector rather than concatenating them, few papers offer a thorough explanation and comparison beyond dimension reduction. However, the addition method is not meaningful because token vectors and position vectors are different physical quantities that cannot be directly combined through addition. Hence, we investigate the disparity in learnable absolute position information between the two embedding methods (concatenation and addition) and compare their performance on models. Experiments demonstrate that the concatenation method can learn more spatial information (such as horizontal, vertical, and angle) than the addition method. Furthermore, it reduces the attention distance in the final few layers. Moreover, the concatenation method exhibits greater robustness and leads to a performance gain of 0.1\u20130.5% for existing models without additional computation overhead.<\/jats:p>","DOI":"10.1007\/s11063-024-11539-7","type":"journal-article","created":{"date-parts":[[2024,2,16]],"date-time":"2024-02-16T00:02:12Z","timestamp":1708041732000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Rethinking Position Embedding Methods in the Transformer Architecture"],"prefix":"10.1007","volume":"56","author":[{"given":"Xin","family":"Zhou","sequence":"first","affiliation":[]},{"given":"Zhaohui","family":"Ren","sequence":"additional","affiliation":[]},{"given":"Shihua","family":"Zhou","sequence":"additional","affiliation":[]},{"given":"Zeyu","family":"Jiang","sequence":"additional","affiliation":[]},{"given":"TianZhuang","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Hengfa","family":"Luo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,2,16]]},"reference":[{"key":"11539_CR1","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (NeurIPS), vol 20"},{"key":"11539_CR2","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171\u20134186"},{"key":"11539_CR3","doi-asserted-by":"crossref","unstructured":"Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. In: ACL, vol 1","DOI":"10.18653\/v1\/P19-1285"},{"key":"11539_CR4","unstructured":"Yan H, Deng B, Li X, Qiu X (2019) TENER: adapting transformer encoder for named entity recognition. arXiv:1911.04474"},{"key":"11539_CR5","unstructured":"Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929"},{"key":"11539_CR6","doi-asserted-by":"crossref","unstructured":"Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision (ECCV), pp 213\u2013229","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"11539_CR7","doi-asserted-by":"crossref","unstructured":"Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 10012\u201310022","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"11539_CR8","doi-asserted-by":"crossref","unstructured":"Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Zhang L (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE\/CVF international conference on computer vision (ICCV), pp 6881\u20136890","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"11539_CR9","doi-asserted-by":"crossref","unstructured":"Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE\/CVF international conference on computer vision (ICCV), pp 568\u2013578","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"11539_CR10","unstructured":"Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Shen C (2021) Twins: revisiting the design of spatial attention in vision transformers. In: Advances in neural information processing systems (NeurIPS), vol 34, pp 9355\u20139366"},{"key":"11539_CR11","doi-asserted-by":"crossref","unstructured":"Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: introducing convolutions to vision transformers. In: Proceedings of the IEEE\/CVF international conference on computer vision (ICCV), pp 22\u201331","DOI":"10.1109\/ICCV48922.2021.00009"},{"key":"11539_CR12","unstructured":"Ke G, He D, Liu TY (2020) Rethinking positional encoding in language pre-training. arXiv:2006.15595"},{"key":"11539_CR13","unstructured":"Su J, Lu Y, Pan S, Wen B, Liu Y (2021) RoFormer: enhanced transformer with rotary position embedding. arXiv:2104.09864"},{"key":"11539_CR14","doi-asserted-by":"crossref","unstructured":"Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv:1803.02155","DOI":"10.18653\/v1\/N18-2074"},{"key":"11539_CR15","doi-asserted-by":"crossref","unstructured":"Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE\/CVF international conference on computer vision (ICCV), pp 10033\u201310041","DOI":"10.1109\/ICCV48922.2021.00988"},{"key":"11539_CR16","doi-asserted-by":"crossref","unstructured":"Bowers BJ, Schatzman L (2021) Dimensional analysis. In: Developing grounded theory. Routledge, New York, pp 111\u2013129","DOI":"10.4324\/9781315169170-10"},{"key":"11539_CR17","doi-asserted-by":"crossref","unstructured":"Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, Sun L, Wang J (2021) Conditional detr for fast training convergence. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 3651\u20133660","DOI":"10.1109\/ICCV48922.2021.00363"},{"key":"11539_CR18","first-page":"6531","volume":"35","author":"S Shi","year":"2022","unstructured":"Shi S, Jiang L, Dai D, Schiele B (2022) Motion transformer with global intention localization and local movement refinement. Adv Neural Inf Process Syst 35:6531\u20136543","journal-title":"Adv Neural Inf Process Syst"},{"key":"11539_CR19","doi-asserted-by":"crossref","unstructured":"Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. IEEE Comput Vis Pattern Recognit (CVPR) 248\u2013255","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"11539_CR20","doi-asserted-by":"crossref","unstructured":"Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843\u2013852","DOI":"10.1109\/ICCV.2017.97"},{"key":"11539_CR21","doi-asserted-by":"crossref","unstructured":"Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (bit): general visual representation learning. In: European conference on computer vision (ECCV), pp 491\u2013507","DOI":"10.1007\/978-3-030-58558-7_29"},{"key":"11539_CR22","doi-asserted-by":"crossref","unstructured":"Xie Q, Luong MT, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 10687\u201310698","DOI":"10.1109\/CVPR42600.2020.01070"},{"key":"11539_CR23","unstructured":"Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, J\u00e9gou H (2021) Training data-efficient image transformers and distillation through attention. In: International conference on machine learning (PMLR), pp 10347\u201310537"},{"key":"11539_CR24","doi-asserted-by":"crossref","unstructured":"Touvron H, Cord M, Sablayrolles A, Synnaeve G, J\u00e9gou H (2021) Going deeper with image transformers. In: Proceedings of the IEEE\/CVF international conference on computer vision, pp 32\u201342","DOI":"10.1109\/ICCV48922.2021.00010"},{"key":"11539_CR25","doi-asserted-by":"crossref","unstructured":"d\u2019Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L (2021) Convit: Improving vision transformers with soft convolutional inductive biases. In: International conference on machine learning, pp 2286\u20132296","DOI":"10.1088\/1742-5468\/ac9830"},{"key":"11539_CR26","doi-asserted-by":"crossref","unstructured":"Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y (2022) Maxvit: multi-axis vision transformer. In: European conference on computer vision, pp 459\u2013479","DOI":"10.1007\/978-3-031-20053-3_27"},{"key":"11539_CR27","doi-asserted-by":"crossref","unstructured":"Lin TY, Doll\u00e1r P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition (CVPR), pp 2117\u20132125","DOI":"10.1109\/CVPR.2017.106"},{"key":"11539_CR28","unstructured":"Li Y, Zhang K, Cao J, Timofte R, Van\u00a0Gool L (2021) Localvit: Bringing locality to vision transformers . arXiv:2104.05707"},{"key":"11539_CR29","doi-asserted-by":"crossref","unstructured":"Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B (2022) Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition, pp 12124\u201312134","DOI":"10.1109\/CVPR52688.2022.01181"},{"key":"11539_CR30","unstructured":"Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. arXiv:2106.04554"},{"key":"11539_CR31","doi-asserted-by":"publisher","DOI":"10.1017\/9781108231596","volume-title":"High-dimensional probability: an introduction with applications in data science","author":"R Vershynin","year":"2018","unstructured":"Vershynin R (2018) High-dimensional probability: an introduction with applications in data science. Cambridge University Press, Cambridge"},{"key":"11539_CR32","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"11539_CR33","doi-asserted-by":"crossref","unstructured":"Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 84\u201390","DOI":"10.1145\/3065386"},{"key":"11539_CR34","unstructured":"Recht B, Roelofs R, Schmidt L, Shankar V (2018) Do cifar-10 classifiers generalize to cifar-10?. arXiv:1806.00451"},{"key":"11539_CR35","doi-asserted-by":"crossref","unstructured":"Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics and image processing, pp. 722\u2013729","DOI":"10.1109\/ICVGIP.2008.47"},{"key":"11539_CR36","doi-asserted-by":"crossref","unstructured":"Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition, pp 3498\u20133505","DOI":"10.1109\/CVPR.2012.6248092"},{"key":"11539_CR37","doi-asserted-by":"crossref","unstructured":"Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554\u2013561","DOI":"10.1109\/ICCVW.2013.77"},{"key":"11539_CR38","first-page":"6575","volume":"45","author":"L Yuan","year":"2022","unstructured":"Yuan L, Hou Q, Jiang Z, Feng J, Yan S (2022) Volo: Vision outlooker for visual recognition. IEEE Trans Pattern Anal Mach Intell 45:6575\u20136586","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"11539_CR39","first-page":"15908","volume":"34","author":"K Han","year":"2021","unstructured":"Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908\u201315919","journal-title":"Adv Neural Inf Process Syst"},{"key":"11539_CR40","unstructured":"Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv:1711.05101"},{"key":"11539_CR41","unstructured":"Loshchilov I, Hutter F (2016) SGDR: stochastic gradi- ent descent with warm restarts. arXiv:1608.03983"},{"key":"11539_CR42","doi-asserted-by":"crossref","unstructured":"Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2022) Transformers in vision: a survey. ACM Comput Surv (CSUR) 1\u201341","DOI":"10.1145\/3505244"}],"container-title":["Neural Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11539-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11063-024-11539-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11063-024-11539-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,16]],"date-time":"2024-05-16T16:18:49Z","timestamp":1715876329000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11063-024-11539-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,16]]},"references-count":42,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["11539"],"URL":"https:\/\/doi.org\/10.1007\/s11063-024-11539-7","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-2525471\/v1","asserted-by":"object"}]},"ISSN":["1573-773X"],"issn-type":[{"value":"1573-773X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,16]]},"assertion":[{"value":"9 January 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 February 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"No conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}],"article-number":"41"}}