{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,23]],"date-time":"2025-11-23T19:08:09Z","timestamp":1763924889020,"version":"3.40.3"},"publisher-location":"Cham","reference-count":12,"publisher":"Springer Nature Switzerland","isbn-type":[{"type":"print","value":"9783031697654"},{"type":"electronic","value":"9783031697661"}],"license":[{"start":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T00:00:00Z","timestamp":1704067200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,8,26]],"date-time":"2024-08-26T00:00:00Z","timestamp":1724630400000},"content-version":"vor","delay-in-days":238,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We delve into the performance of transformer encoder inference on low-power multi-core processors from two perspectives: First, we conduct a detailed profile of the inference process for two members of the\u00a0BERT family on a modern multi-core processor, identifying the main bottlenecks and opportunities for improvement. Second, we propose a number of accumulative optimisations for their primary building blocks. For that, we elaborate our own implementation of the general matrix multiplication (), which dynamically tunes several key parameters yielding relevant performance gains for transformer encoders. Additionally, we introduce a number of strategies to also improve the parallel execution of the transformer block.<\/jats:p><jats:p>Our implementations for ARMv8a and RISC-V multi-core processors with SIMD units, taking as a reference state-of-the-art  implementations (BLIS for ARM and OpenBLAS for RISC-V) reveal accelerations of up to <jats:inline-formula><jats:alternatives><jats:tex-math>$$2.5\\times $$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>2.5<\/mml:mn>\n                    <mml:mo>\u00d7<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math><\/jats:alternatives><\/jats:inline-formula> for natural language processing tasks.<\/jats:p>","DOI":"10.1007\/978-3-031-69766-1_26","type":"book-chapter","created":{"date-parts":[[2024,8,25]],"date-time":"2024-08-25T19:02:05Z","timestamp":1724612525000},"page":"377-392","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Inference with\u00a0Transformer Encoders on\u00a0ARM and\u00a0RISC-V Multicore Processors"],"prefix":"10.1007","author":[{"given":"H\u00e9ctor","family":"Mart\u00ednez","sequence":"first","affiliation":[]},{"given":"Francisco D.","family":"Igual","sequence":"additional","affiliation":[]},{"given":"Rafael","family":"Rodr\u00edguez-S\u00e1nchez","sequence":"additional","affiliation":[]},{"given":"Sandra","family":"Catal\u00e1n","sequence":"additional","affiliation":[]},{"given":"Adri\u00e1n","family":"Castell\u00f3","sequence":"additional","affiliation":[]},{"given":"Enrique S.","family":"Quintana-Ort\u00ed","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,8,26]]},"reference":[{"doi-asserted-by":"crossref","unstructured":"Alaejos, G., et\u00a0al.: Algorithm 1039: automatic generators for a family of matrix multiplication routines with Apache TVM. ACM Trans. Math. Softw. 50(1), 6:1\u20136:34 (2024)","key":"26_CR1","DOI":"10.1145\/3638532"},{"key":"26_CR2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2023.102990","volume":"144","author":"KT Chitty-Venkata","year":"2023","unstructured":"Chitty-Venkata, K.T., et al.: A survey of techniques for optimizing transformer inference. J. Syst. Arch. 144, 102990 (2023)","journal-title":"J. Syst. Arch."},{"unstructured":"Devlin, J., et\u00a0al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference North American Chapter Association Computational Linguistics: Human Language Technology, pp. 4171\u20134186 (2019)","key":"26_CR3"},{"unstructured":"Dice, D., Kogan, A.: Optimizing inference performance of transformers on CPUs. arXiv arxiv:2102.06621 (2021)","key":"26_CR4"},{"doi-asserted-by":"crossref","unstructured":"Goto, K., van\u00a0de Geijn, R.A.: Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1\u201312:25 (2008)","key":"26_CR5","DOI":"10.1145\/1356052.1356053"},{"issue":"2","key":"26_CR6","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1145\/3282307","volume":"62","author":"JL Hennessy","year":"2019","unstructured":"Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Commun. ACM 62(2), 48\u201360 (2019)","journal-title":"Commun. ACM"},{"issue":"7","key":"26_CR7","doi-asserted-by":"publisher","first-page":"2221","DOI":"10.1109\/TPDS.2023.3280805","volume":"34","author":"J Jiang","year":"2023","unstructured":"Jiang, J., et al.: Full-stack optimizing transformer inference on ARM many-core CPU. IEEE Trans. Parallel Distrib. Syst. 34(7), 2221\u20132235 (2023)","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"unstructured":"Kim, S., et\u00a0al.: Full stack optimization of transformer inference: a survey. arXiv arxiv:2302.14017 (2023)","key":"26_CR8"},{"doi-asserted-by":"crossref","unstructured":"Low, T.M., et\u00a0al.: Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43(2), 12:1\u201312:18 (2016)","key":"26_CR9","DOI":"10.1145\/2925987"},{"unstructured":"Silvano, C., et\u00a0al.: A survey on deep learning hardware accelerators for heterogeneous HPC platforms. arXiv arxiv:2306.15552 (2023)","key":"26_CR10"},{"unstructured":"Smith, T.M., van\u00a0de Geijn, R.A.: The MOMMS family of matrix multiplication algorithms. arXiv arxiv:1904.05717 (2019)","key":"26_CR11"},{"doi-asserted-by":"crossref","unstructured":"Van\u00a0Zee, F.G., van\u00a0de\u00a0Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1\u201314:33 (2015)","key":"26_CR12","DOI":"10.1145\/2764454"}],"container-title":["Lecture Notes in Computer Science","Euro-Par 2024: Parallel Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-031-69766-1_26","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,8,25]],"date-time":"2024-08-25T19:12:19Z","timestamp":1724613139000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-031-69766-1_26"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"ISBN":["9783031697654","9783031697661"],"references-count":12,"URL":"https:\/\/doi.org\/10.1007\/978-3-031-69766-1_26","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2024]]},"assertion":[{"value":"26 August 2024","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"Euro-Par","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"European Conference on Parallel Processing","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Madrid","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Spain","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2024","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"26 August 2024","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"30 August 2024","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"30","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"europar2024","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/2024.euro-par.org\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}