{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T16:07:09Z","timestamp":1772813229668,"version":"3.50.1"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643686547","type":"electronic"}],"license":[{"start":{"date-parts":[[2026,3,4]],"date-time":"2026-03-04T00:00:00Z","timestamp":1772582400000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,3,4]]},"abstract":"<jats:p>Rapidly developing large language models (LLMs) stimulate the search for efficient training strategies coping with the large computational expenses caused by distributed training. Gradient synchronization is an important component of the overall distributed training process, especially in Sharded Parallelism mode. In this paper, we present Transformer-Aware Gradient Compression-2 (TAGC-2), a further advancement of the gradient compression algorithm that specializes in transformer-based models. TAGC-2 further develops the TAGC method with the following improvements: the communication\/computation overlap was optimized; the method was adapted to the network conditions currently used employing bfloat16 quantization and rewriting sparsification as kernels. TAGC-2 enables long training runs as it implements checkpointing and restarts upon network or memory failures. The experiments demonstrate that TAGC-2 improves the wall clock training time by 3.1% under low network bandwidth conditions and shortens the iteration time by 4.6-10% with a minimal loss degradation under common network bandwidth conditions, compared to the Fully Sharded Data Parallel (FSDP) baseline. The implementation is publicly available as an open source code at https:\/\/github.com\/ipolyakov\/TAGC.<\/jats:p>","DOI":"10.3233\/faia260015","type":"book-chapter","created":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T10:20:46Z","timestamp":1772792446000},"source":"Crossref","is-referenced-by-count":0,"title":["TAGC-2: More Efficient Transformer Training in Distributed Environments"],"prefix":"10.3233","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-0229-5380","authenticated-orcid":false,"given":"Igor","family":"Polyakov","sequence":"first","affiliation":[{"name":"ITMO University, Russia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1011-9932","authenticated-orcid":false,"given":"Alexey","family":"Dukhanov","sequence":"additional","affiliation":[{"name":"ITMO University, Russia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"7437","container-title":["Frontiers in Artificial Intelligence and Applications","Machine Learning and Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/FAIA260015","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,6]],"date-time":"2026-03-06T10:20:46Z","timestamp":1772792446000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/FAIA260015"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,3,4]]},"ISBN":["9781643686547"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/faia260015","relation":{},"ISSN":["0922-6389","1879-8314"],"issn-type":[{"value":"0922-6389","type":"print"},{"value":"1879-8314","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,3,4]]}}}