{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,15]],"date-time":"2026-01-15T09:00:24Z","timestamp":1768467624299,"version":"3.49.0"},"reference-count":29,"publisher":"Springer Science and Business Media LLC","issue":"10","license":[{"start":{"date-parts":[[2024,3,11]],"date-time":"2024-03-11T00:00:00Z","timestamp":1710115200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,3,11]],"date-time":"2024-03-11T00:00:00Z","timestamp":1710115200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Supercomput"],"published-print":{"date-parts":[[2024,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>In the past few years, Transformer-based large language models (LLM) have become the dominant technology in a series of applications. To scale up the sequence length of the Transformer, FlashAttention is proposed to compute exact attention with reduced memory requirements and faster execution. However, implementing the FlashAttention algorithm on the new generation Sunway Supercomputer faces many constraints such as the unique heterogeneous architecture and the limited memory bandwidth. This work proposes SWattention, a highly efficient method for computing the exact attention on the SW26010pro processor. To fully utilize the 6 core groups (CG) and 64 cores per CG on the processor, we design a two-level parallel task partition strategy. Asynchronous memory access is employed to ensure that memory access overlaps with computation. Additionally, a tiling strategy is introduced to determine optimal SRAM block sizes. Compared with the standard attention, SWattention achieves around 2.0x speedup for FP32 training and 2.5x speedup for mixed-precision training. The sequence lengths range from 1k to 8k and scale up to 16k without being out of memory. As for the end-to-end performance, SWattention achieves up to 1.26x speedup for training GPT-style models, which demonstrates that SWattention enables longer sequence length for LLM training.<\/jats:p>","DOI":"10.1007\/s11227-024-05890-8","type":"journal-article","created":{"date-parts":[[2024,3,11]],"date-time":"2024-03-11T14:12:20Z","timestamp":1710166340000},"page":"13657-13680","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer"],"prefix":"10.1007","volume":"80","author":[{"given":"Ruohan","family":"Wu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xianyu","family":"Zhu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Junshi","family":"Chen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sha","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tianyu","family":"Zheng","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xin","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hong","family":"An","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,3,11]]},"reference":[{"key":"5890_CR1","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805"},{"key":"5890_CR2","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877\u20131901","journal-title":"Adv Neural Inf Process Syst"},{"key":"5890_CR3","unstructured":"OpenAI R (2023) GPT-4 technical report. arXiv:2303.08774. View in Article 2"},{"key":"5890_CR4","unstructured":"Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozi\u00e8re B, Goyal N, Hambro E, Azhar F et al (2023) Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971"},{"key":"5890_CR5","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser \u0141, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30"},{"key":"5890_CR6","unstructured":"Tay Y, Dehghani M, Abnar S, Shen Y, Bahri D, Pham P, Rao J, Yang L, Ruder S, Metzler D (2020) Long range arena: a benchmark for efficient transformers. arXiv preprint arXiv:2011.04006"},{"key":"5890_CR7","first-page":"711","volume":"3","author":"A Ivanov","year":"2021","unstructured":"Ivanov A, Dryden N, Ben-Nun T, Li S, Hoefler T (2021) Data movement is all you need: A case study on optimizing transformers. Proc Mach Learn Syst 3:711\u2013732","journal-title":"Proc Mach Learn Syst"},{"key":"5890_CR8","first-page":"16344","volume":"35","author":"T Dao","year":"2022","unstructured":"Dao T, Fu D, Ermon S, Rudra A, R\u00e9 C (2022) Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv Neural Inf Process Syst 35:16344\u201316359","journal-title":"Adv Neural Inf Process Syst"},{"key":"5890_CR9","unstructured":"Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z et al (2023) A survey of large language models. arXiv preprint arXiv:2303.18223"},{"key":"5890_CR10","doi-asserted-by":"publisher","first-page":"224","DOI":"10.1007\/s42514-021-00072-x","volume":"3","author":"S Liu","year":"2021","unstructured":"Liu S, Gao J, Liu X, Huang Z, Zheng T (2021) Establishing high performance AI ecosystem on Sunway platform. CCF Trans High Perform Comput 3:224\u2013241","journal-title":"CCF Trans High Perform Comput"},{"issue":"11","key":"5890_CR11","doi-asserted-by":"publisher","first-page":"2846","DOI":"10.1109\/TPDS.2022.3145163","volume":"33","author":"M Li","year":"2022","unstructured":"Li M, Chen J, Xiao Q, Wang F, Jiang Q, Zhao X, Lin R, An H, Liang X, He L (2022) Bridging the gap between deep learning and frustrated quantum spin system for extreme-scale simulations on new generation of sunway supercomputer. IEEE Trans Parallel Distrib Syst 33(11):2846\u20132859","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"5890_CR12","doi-asserted-by":"crossref","unstructured":"Ma Z, He J, Qiu J, Cao H, Wang Y, Sun Z, Zheng L, Wang H, Tang S, Zheng T et al (2022) BaGualu: targeting brain scale pretrained models with over 37 million cores. In: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 192\u2013204","DOI":"10.1145\/3503221.3508417"},{"key":"5890_CR13","doi-asserted-by":"crossref","unstructured":"Zhao Y, Zheng J, Fu H, Wu W, Gao J, Chen M, Zhang J, Zhang L, Dong R., Du Z et al (2023) SW-LCM: a scalable and weakly-supervised land cover mapping method on a new Sunway supercomputer. In: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, pp 657\u2013667","DOI":"10.1109\/IPDPS54959.2023.00071"},{"issue":"4","key":"5890_CR14","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1145\/1498765.1498785","volume":"52","author":"S Williams","year":"2009","unstructured":"Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65\u201376","journal-title":"Commun ACM"},{"issue":"2","key":"5890_CR15","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1109\/MM.2021.3061394","volume":"41","author":"J Choquette","year":"2021","unstructured":"Choquette J, Gandhi W, Giroux O, Stam N, Krashinsky R (2021) Nvidia a100 tensor core GPU: performance and innovation. IEEE Micro 41(2):29\u201335","journal-title":"IEEE Micro"},{"key":"5890_CR16","unstructured":"Chen T, Moreau T, Jiang Z, Zheng L, Yan E, Shen H, Cowan M, Wang L, Hu Y, Ceze L et al (2018) $$\\{$$TVM$$\\}$$: an automated $$\\{$$End-to-End$$\\}$$ optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp 578\u2013594"},{"key":"5890_CR17","doi-asserted-by":"crossref","unstructured":"Sivathanu M, Chugh T, Singapuram SS, Zhou L (2019) Astra: Exploiting predictability to optimize deep learning. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp 909\u2013923","DOI":"10.1145\/3297858.3304072"},{"issue":"3","key":"5890_CR18","doi-asserted-by":"publisher","first-page":"708","DOI":"10.1109\/TPDS.2020.3030548","volume":"32","author":"M Li","year":"2020","unstructured":"Li M, Liu Y, Liu X, Sun Q, You X, Yang H, Luan Z, Gan L, Yang G, Qian D (2020) The deep learning compiler: a comprehensive survey. IEEE Trans Parallel Distrib Syst 32(3):708\u2013727","journal-title":"IEEE Trans Parallel Distrib Syst"},{"key":"5890_CR19","unstructured":"Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B (2019) Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053"},{"key":"5890_CR20","doi-asserted-by":"crossref","unstructured":"Rasley J, Rajbhandari S, Ruwase O, He Y (2020) DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 3505\u20133506","DOI":"10.1145\/3394486.3406703"},{"issue":"4","key":"5890_CR21","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-020-3104-7","volume":"64","author":"J Gao","year":"2021","unstructured":"Gao J, Zheng F, Qi F, Ding Y, Li H, Lu H, He W, Wei H, Jin L, Liu X et al (2021) Sunway supercomputer architecture towards exascale computing: analysis and practice. SCIENCE CHINA Inf Sci 64(4):141101","journal-title":"SCIENCE CHINA Inf Sci"},{"issue":"5","key":"5890_CR22","doi-asserted-by":"publisher","first-page":"146","DOI":"10.3390\/fi14050146","volume":"14","author":"A Asad","year":"2022","unstructured":"Asad A, Kaur R, Mohammadi F (2022) A survey on memory subsystems for deep neural network accelerators. Future Internet 14(5):146","journal-title":"Future Internet"},{"key":"5890_CR23","unstructured":"Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, Hawkins P, Lee H, Hong M, Young C et al (2018) Mesh-tensorflow: Deep learning for supercomputers. In: Advances in neural information processing systems, vol 31"},{"key":"5890_CR24","doi-asserted-by":"crossref","unstructured":"Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V, Vainbrand D, Kashinkunti P, Bernauer J, Catanzaro B et al (2021) Efficient large-scale language model training on GPU clusters using megatron-LM. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1\u201315","DOI":"10.1145\/3458817.3476209"},{"key":"5890_CR25","unstructured":"Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, vol 32"},{"key":"5890_CR26","unstructured":"Dao T (2023) Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691"},{"key":"5890_CR27","unstructured":"Kitaev N, Kaiser \u0141, Levskaya A (2020) Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451"},{"key":"5890_CR28","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1162\/tacl_a_00353","volume":"9","author":"A Roy","year":"2021","unstructured":"Roy A, Saffar M, Vaswani A, Grangier D (2021) Efficient content-based sparse attention with routing transformers. Trans Assoc Comput Linguist 9:53\u201368","journal-title":"Trans Assoc Comput Linguist"},{"key":"5890_CR29","first-page":"17413","volume":"34","author":"B Chen","year":"2021","unstructured":"Chen B, Dao T, Winsor E, Song Z, Rudra A, R\u00e9 C (2021) Scatterbrain: unifying sparse and low-rank attention. Adv Neural Inf Process Syst 34:17413\u201317426","journal-title":"Adv Neural Inf Process Syst"}],"container-title":["The Journal of Supercomputing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-05890-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11227-024-05890-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11227-024-05890-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,10]],"date-time":"2024-06-10T11:13:56Z","timestamp":1718018036000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11227-024-05890-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,11]]},"references-count":29,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,7]]}},"alternative-id":["5890"],"URL":"https:\/\/doi.org\/10.1007\/s11227-024-05890-8","relation":{},"ISSN":["0920-8542","1573-0484"],"issn-type":[{"value":"0920-8542","type":"print"},{"value":"1573-0484","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,3,11]]},"assertion":[{"value":"4 January 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 March 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"All authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}