{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T06:16:14Z","timestamp":1778134574370,"version":"3.51.4"},"reference-count":42,"publisher":"SAGE Publications","issue":"3","license":[{"start":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T00:00:00Z","timestamp":1763510400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"DOI":"10.13039\/100016818","name":"UT-Battelle","doi-asserted-by":"publisher","award":["DE-AC05-00OR22725"],"award-info":[{"award-number":["DE-AC05-00OR22725"]}],"id":[{"id":"10.13039\/100016818","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2026,5]]},"abstract":"<jats:p>Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text have inspired scaling sequence lengths in ViTs, adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a vision transformer model to convergence with a sequence length of 188K tokens, using full self-attention.<\/jats:p>","DOI":"10.1177\/10943420251394758","type":"journal-article","created":{"date-parts":[[2025,11,19]],"date-time":"2025-11-19T19:51:27Z","timestamp":1763581887000},"page":"273-290","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":1,"title":["Sequence length scaling in vision transformers for scientific images on frontier"],"prefix":"10.1177","volume":"40","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7734-3349","authenticated-orcid":false,"given":"Aristeidis","family":"Tsaris","sequence":"first","affiliation":[{"name":"National Center for Computational Sciences, Oak Ridge National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chengming","family":"Zhang","sequence":"additional","affiliation":[{"name":"Indiana University Bloomington"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xiao","family":"Wang","sequence":"additional","affiliation":[{"name":"Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Junqi","family":"Yin","sequence":"additional","affiliation":[{"name":"National Center for Computational Sciences, Oak Ridge National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Siyan","family":"Liu","sequence":"additional","affiliation":[{"name":"Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Moetasim","family":"Ashfaq","sequence":"additional","affiliation":[{"name":"Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ming","family":"Fan","sequence":"additional","affiliation":[{"name":"Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jong Youl","family":"Choi","sequence":"additional","affiliation":[{"name":"Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7165-2095","authenticated-orcid":false,"given":"Mohamed","family":"Wahib","sequence":"additional","affiliation":[{"name":"High Performance Artificial Intelligence Systems, RIKEN Center for Computational Science"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Dan","family":"Lu","sequence":"additional","affiliation":[{"name":"Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Prasanna","family":"Balaprakash","sequence":"additional","affiliation":[{"name":"Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0099-1559","authenticated-orcid":false,"given":"Feiyi","family":"Wang","sequence":"additional","affiliation":[{"name":"National Center for Computational Sciences, Oak Ridge National Laboratory"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2025,11,19]]},"reference":[{"key":"e_1_3_4_2_1","volume-title":"Flash Attention on ROCm","author":"AMD","year":"2023","unstructured":"AMD (2023) Flash Attention on ROCm. AMD. Available at: https:\/\/github.com\/ROCmSoftwarePlatform\/flash-attentio."},{"key":"e_1_3_4_3_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-023-06185-3"},{"key":"e_1_3_4_4_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-023-02378-7"},{"key":"e_1_3_4_5_1","doi-asserted-by":"crossref","unstructured":"Cha K Seo J Lee T (2023) A billion-scale foundation model for remote sensing images.","DOI":"10.1109\/JSTARS.2024.3401772"},{"key":"e_1_3_4_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01567"},{"key":"e_1_3_4_7_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41612-023-00512-1"},{"key":"e_1_3_4_8_1","unstructured":"Dao T (2023) Flashattention-2: faster attention with better parallelism and work partitioning."},{"key":"e_1_3_4_9_1","doi-asserted-by":"crossref","unstructured":"Dao T Fu DY Ermon S et al. (2022) Flashattention: fast and memory-efficient exact attention with io-awareness. ArXiv abs\/2205.14135. https:\/\/api.semanticscholar.org\/CorpusID:249151871","DOI":"10.52202\/068431-1189"},{"key":"e_1_3_4_10_1","unstructured":"Dee (2022) DeepSpeed. https:\/\/github.com\/microsoft\/DeepSpeed"},{"key":"e_1_3_4_11_1","unstructured":"Dehghani M Djolonga J Mustafa B et al. (2023) Scaling vision transformers to 22 billion parameters."},{"key":"e_1_3_4_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_4_13_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41597-023-02125-y"},{"key":"e_1_3_4_14_1","unstructured":"Dosovitskiy A Beyer L Kolesnikov A et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale."},{"issue":"1","key":"e_1_3_4_15_1","first-page":"1532","article-title":"Switch transformers: scaling to trillion parameter models with simple and efficient sparsity","volume":"23","author":"Fedus W","year":"2022","unstructured":"Fedus W, Zoph B, Shazeer N (2022) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23(1): 1532\u20134435.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_4_16_1","unstructured":"Fro (2022) The frontier supercomputer https:\/\/www.olcf.ornl.gov\/frontier\/"},{"key":"e_1_3_4_17_1","unstructured":"Gupta R Li S Zhu T et al. (2024) xt: nested tokenization for larger context in large images."},{"key":"e_1_3_4_18_1","unstructured":"HRR (2020) The high-resolution rapid refresh. https:\/\/rapidrefresh.noaa.gov\/hrrr\/"},{"key":"e_1_3_4_19_1","unstructured":"Huang Y Cheng Y Bapna A et al. (2019) Gpipe: efficient training of giant neural networks using pipeline parallelism."},{"key":"e_1_3_4_20_1","doi-asserted-by":"crossref","unstructured":"Jacobs SA Tanaka M Zhang C et al. (2023) Deepspeed ulysses: system optimizations for enabling training of extreme long sequence transformer models.","DOI":"10.1109\/IPDPSW63119.2024.00208"},{"key":"e_1_3_4_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3505244"},{"key":"e_1_3_4_22_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.media.2020.101854"},{"key":"e_1_3_4_23_1","unstructured":"Korthikanti V Casper J Lym S et al. (2022) Reducing activation recomputation in large transformer models."},{"key":"e_1_3_4_24_1","doi-asserted-by":"crossref","unstructured":"Li S Xue F Baranwal C et al. (2022) Sequence parallelism: long sequence training from system perspective.","DOI":"10.18653\/v1\/2023.acl-long.134"},{"key":"e_1_3_4_25_1","doi-asserted-by":"crossref","unstructured":"Li C Gan Z Yang Z et al. (2023) Multimodal foundation models: from specialists to general-purpose assistants.","DOI":"10.1561\/9781638283379"},{"key":"e_1_3_4_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_4_27_1","unstructured":"Liu H Zaharia M Abbeel P (2023) Ring attention with blockwise transformers for near-infinite context."},{"key":"e_1_3_4_28_1","unstructured":"Meg (2022) Megatron-DeepSpeed. https:\/\/github.com\/microsoft\/Megatron-DeepSpeed"},{"key":"e_1_3_4_29_1","unstructured":"Nguyen T Brandstetter J Kapoor A et al. (2023) Climax: a foundation model for weather and climate. arXiv preprint arXiv:2301.10343."},{"key":"e_1_3_4_30_1","unstructured":"Pathak J Subramanian S Harrington P et al. (2022) Fourcastnet: a global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv preprint arXiv:2202.11214."},{"key":"e_1_3_4_31_1","doi-asserted-by":"crossref","unstructured":"Rajbhandari S Ruwase O Rasley J et al. (2021) Zero-infinity: breaking the gpu memory wall for extreme scale deep learning.","DOI":"10.1145\/3458817.3476205"},{"key":"e_1_3_4_32_1","unstructured":"Ren J Rajbhandari S Aminabadi RY et al. (2021) Zero-offload: democratizing billion-scale model training."},{"key":"e_1_3_4_33_1","unstructured":"Shoeybi M Patwary M Puri R et al. (2020) Megatron-lm: training multi-billion parameter language models using model parallelism."},{"key":"e_1_3_4_34_1","unstructured":"Simmons A Soci C Nicolas J et al. (2020) Global stratospheric temperature bias and other stratospheric aspects of era5 and era5. https:\/\/www.ecmwf.int\/node\/19362"},{"key":"e_1_3_4_35_1","unstructured":"Vaswani A Shazeer N Parmar N et al. (2023) Attention is all you need."},{"key":"e_1_3_4_36_1","unstructured":"Wang S Li BZ Khabsa M et al. (2020) Linformer: self-Attention with linear complexity."},{"key":"e_1_3_4_37_1","unstructured":"Wang X Lyngaas I Tsaris A et al. (2023) Ultra-long sequence distributed transformer."},{"key":"e_1_3_4_38_1","unstructured":"xes (2020) xesmf: universal regridder for geospatial data. https:\/\/doi.org\/10.5281\/zenodo.4294774"},{"key":"e_1_3_4_39_1","doi-asserted-by":"crossref","unstructured":"Xiong Z Wang Y Zhang F et al. (2024) One for all: toward unified foundation models for Earth vision.","DOI":"10.1109\/IGARSS53475.2024.10641637"},{"key":"e_1_3_4_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/3581784.3613215"},{"key":"e_1_3_4_41_1","doi-asserted-by":"crossref","unstructured":"Yin J Bose A Cong G et al. (2024) Comparative study of large language model architectures on frontier.","DOI":"10.1109\/IPDPS57955.2024.00056"},{"key":"e_1_3_4_42_1","doi-asserted-by":"crossref","unstructured":"Zhao Y Gu A Varma R et al. (2023) Pytorch fsdp: experiences on scaling fully sharded data parallel.","DOI":"10.14778\/3611540.3611569"},{"key":"e_1_3_4_43_1","unstructured":"Zhao X Cheng S Zheng Z et al. (2024) Dsp: dynamic sequence parallelism for multi-dimensional transformers."}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251394758","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420251394758","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420251394758","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T05:46:18Z","timestamp":1778132778000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420251394758"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,19]]},"references-count":42,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2026,5]]}},"alternative-id":["10.1177\/10943420251394758"],"URL":"https:\/\/doi.org\/10.1177\/10943420251394758","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,19]]}}}