{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T09:40:59Z","timestamp":1775122859509,"version":"3.50.1"},"reference-count":47,"publisher":"IOP Publishing","issue":"4","license":[{"start":{"date-parts":[[2024,12,4]],"date-time":"2024-12-04T00:00:00Z","timestamp":1733270400000},"content-version":"vor","delay-in-days":3,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2024,12,4]],"date-time":"2024-12-04T00:00:00Z","timestamp":1733270400000},"content-version":"tdm","delay-in-days":3,"URL":"https:\/\/iopscience.iop.org\/info\/page\/text-and-data-mining"}],"funder":[{"DOI":"10.13039\/501100010031","name":"Postdoctoral Research Foundation of China","doi-asserted-by":"crossref","award":["2022M712540"],"award-info":[{"award-number":["2022M712540"]}],"id":[{"id":"10.13039\/501100010031","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["iopscience.iop.org"],"crossmark-restriction":false},"short-container-title":["Mach. Learn.: Sci. Technol."],"published-print":{"date-parts":[[2024,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>In large language models (LLMs), full-parameter fine-tuning is crucial for task-specific adaptation. Traditionally, this relies on deep learning training frameworks utilizing the back-propagation scheme. However, this scheme presents inherent issues, e.g. activation memory bottlenecks and backward locking, which limit the efficient computational resource usage. In this work, we propose the design and analysis of ZeROf-Offload, an innovative fine-tuning framework that adapts the forward-gradient scheme. This framework adopts a unique forward-gradient-oriented CPU offload strategy, enabling fine-tuning of billion-scale LLMs solely in the forward phase and enhancing computational efficiency. Empirical evaluations reveal the advantage of eliminating the backward phase in fine-tuning. ZeROf-Offload achieves134 TFlops\/GPU for models with over 130 billion parameters on a single DGX-A100 node, outperforming DeepSpeed\u2019s ZeRO-Offload, which achieves 102 TFlops\/GPU for models with up to 53.7 billion parameters, the largest size manageable within GPU memory limitations. Furthermore, we have expanded ZeROf-Offload for multi-DGX-A100 environments with integrated 3D parallelism, achieving near-linear speedup across up to 128 GPUs and the token throughput by 1.4x and 1.5x, respectively. The experimental results demonstrate that the proposed ZeROf-Offload has achieved the highest throughput performance compared to all examined state-of-the-art frameworks.<\/jats:p>","DOI":"10.1088\/2632-2153\/ad9667","type":"journal-article","created":{"date-parts":[[2024,11,22]],"date-time":"2024-11-22T23:00:51Z","timestamp":1732316451000},"page":"045054","update-policy":"https:\/\/doi.org\/10.1088\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["ZeROf-Offload: forward-gradient scheme for efficient full parameter fine-tuning of billion-scale language models"],"prefix":"10.1088","volume":"5","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4198-9009","authenticated-orcid":true,"given":"Jian","family":"Zhu","sequence":"first","affiliation":[]},{"given":"Peicheng","family":"Feng","sequence":"additional","affiliation":[]},{"given":"Jiawei","family":"Lu","sequence":"additional","affiliation":[]},{"given":"Bowei","family":"Fang","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0009-0523-8337","authenticated-orcid":true,"given":"Hesong","family":"Yang","sequence":"additional","affiliation":[]}],"member":"266","published-online":{"date-parts":[[2024,12,4]]},"reference":[{"key":"mlstad9667bib1","article-title":"Chatlaw: open-source legal large language model with integrated external knowledge bases","author":"Cui","year":"2023"},{"key":"mlstad9667bib2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3605943","article-title":"Recent advances in natural language processing via large pre-trained language models: a survey","volume":"56","author":"Min","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"mlstad9667bib3","article-title":"Towards expert-level medical question answering with large language models","author":"Singhal","year":"2023"},{"key":"mlstad9667bib4","article-title":"Deepspeed4science initiative: enabling large-scale scientific discovery through sophisticated AI system technologies","author":"Song","year":"2023"},{"key":"mlstad9667bib5","article-title":"ZeroQuant-HERO: hardware-enhanced robust optimized post-training quantization framework for w8a8 trans-formers","author":"Yao","year":"2023"},{"key":"mlstad9667bib6","article-title":"Geogpt: understanding and processing geospatial tasks through an autonomous GPT","author":"Zhang","year":"2023"},{"key":"mlstad9667bib7","first-page":"1686","article-title":"Black-box prompt tuning for vision-language model as service Proc.","author":"Yu","year":"2023"},{"key":"mlstad9667bib8","first-page":"9733","article-title":"Evaluating commonsense in pre-trained language models","volume":"vol 34","author":"Zhou","year":"2020"},{"key":"mlstad9667bib9","doi-asserted-by":"publisher","first-page":"763","DOI":"10.1017\/S1351324921000322","article-title":"Emerging trends: a gentle introduction to fine-tuning","volume":"27","author":"Church","year":"2021","journal-title":"Nat. Lang. Eng."},{"key":"mlstad9667bib10","doi-asserted-by":"publisher","first-page":"220","DOI":"10.1038\/s42256-023-00626-4","article-title":"Parameter-efficient fine tuning of large-scale pre-trained language models","volume":"5","author":"Ding","year":"2023","journal-title":"Nat. Mach. Intell."},{"key":"mlstad9667bib11","doi-asserted-by":"publisher","first-page":"1","DOI":"10.5555\/3648699.3648939","article-title":"Palm: scaling language modeling with pathways","volume":"24","author":"Chowdhery","year":"2023","journal-title":"J. Mach. Learn. Res."},{"key":"mlstad9667bib12","article-title":"Mixed precision training","author":"Micikevicius","year":"2018"},{"key":"mlstad9667bib13","doi-asserted-by":"publisher","first-page":"406","DOI":"10.3389\/fnins.2020.00406","article-title":"Mixed-precision deep learning based on computational memory","volume":"14","author":"Nandakumar","year":"2020","journal-title":"Front. Neurosci."},{"key":"mlstad9667bib14","first-page":"551","article-title":"ZeRO-Offload: democratizing billion-scale model training","author":"Ren","year":"2021"},{"key":"mlstad9667bib15","author":"DeepSpeed Team and Rangan Majumder","year":"2020"},{"key":"mlstad9667bib16","first-page":"21","article-title":"A theoretical framework for back-propagation","volume":"vol 1","author":"LeCun","year":"1988"},{"key":"mlstad9667bib17","first-page":"41","article-title":"Superneurons: dynamic GPU memory management for training deep neural networks","author":"Wang","year":"2018"},{"key":"mlstad9667bib18","article-title":"Training deep nets with sublinear memory cost","author":"Chen","year":"2016"},{"key":"mlstad9667bib19","first-page":"497","article-title":"Checkmate: breaking the memory wall with optimal tensor rematerialization","volume":"vol 2","author":"Jain","year":"2020"},{"key":"mlstad9667bib20","first-page":"1","article-title":"Beyond data and model parallelism for deep neural networks","volume":"vol 1","author":"Jia","year":"2019"},{"key":"mlstad9667bib21","article-title":"Gradients without back-propagation","author":"Baydin","year":"2022"},{"key":"mlstad9667bib22","doi-asserted-by":"publisher","first-page":"243","DOI":"10.1016\/0893-6080(89)90035-X","article-title":"Neural network models for pattern recognition and associative memory","volume":"2","author":"Carpenter","year":"1989","journal-title":"Neural Netw."},{"key":"mlstad9667bib23","first-page":"776","article-title":"Gist: efficient data encoding for deep neural network training","author":"Jain","year":"2018"},{"key":"mlstad9667bib24","article-title":"Gradient following without back-propagation in layered networks","volume":"2","author":"Barto","year":"1987"},{"key":"mlstad9667bib25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3234150","article-title":"A survey on deep learning: algorithms, techniques, and applications","volume":"51","author":"Pouyanfar","year":"2018","journal-title":"ACM Comput. Surv."},{"key":"mlstad9667bib26","doi-asserted-by":"publisher","first-page":"292","DOI":"10.3390\/electronics8030292","article-title":"A state-of-the-art survey on deep learning theory and architectures","volume":"8","author":"Alom","year":"2019","journal-title":"Electronics"},{"key":"mlstad9667bib27","first-page":"5085","article-title":"The HSIC bottleneck: deep learning without back-propagation","volume":"vol 34","author":"Ma","year":"2020"},{"key":"mlstad9667bib28","doi-asserted-by":"publisher","first-page":"53040","DOI":"10.1109\/ACCESS.2019.2912200","article-title":"Review of deep learning algorithms and architectures","volume":"7","author":"Shrestha","year":"2019","journal-title":"IEEE Access"},{"key":"mlstad9667bib29","article-title":"Deepzero: scaling up zeroth-order optimization for deep model training","author":"Chen","year":"2023"},{"key":"mlstad9667bib30","article-title":"Can forward gradient match backpropagation?","author":"Fournier","year":"2023"},{"key":"mlstad9667bib31","article-title":"Federated fine-tuning of billion-sized language models across mobile devices","author":"Xu","year":"2023"},{"key":"mlstad9667bib32","doi-asserted-by":"crossref","DOI":"10.14778\/3415478.3415530","article-title":"Pytorch distributed: experiences on accelerating data parallel training","author":"Li","year":"2020"},{"key":"mlstad9667bib33","article-title":"Mesh-tensorflow: deep learning for super-computers","author":"Shazeer","year":"2018"},{"key":"mlstad9667bib34","article-title":"Gpipe: efficient training of giant neural networks using pipeline parallelism","author":"Huang","year":"2019"},{"key":"mlstad9667bib35","doi-asserted-by":"crossref","DOI":"10.1109\/SC41405.2020.00024","article-title":"Zero: memory optimizations toward training trillion parameter models","author":"Rajbhandari","year":"2020"},{"key":"mlstad9667bib36","article-title":"Adam: a method for stochastic optimization","author":"Kingma","year":"2014"},{"key":"mlstad9667bib37","first-page":"1341","article-title":"Swapadvisor: pushing deep learning beyond the GPU memory limit via smart swapping","author":"Huang","year":"2020"},{"key":"mlstad9667bib38","article-title":"Training large neural networks with constant memory using a new execution algorithm","author":"Pudipeddi","year":"2020"},{"key":"mlstad9667bib39","first-page":"598","article-title":"Sentinel: efficient tensor migration and allocation on heterogeneous memory systems for deep learning","author":"Ren","year":"2021"},{"key":"mlstad9667bib40","first-page":"556","article-title":"Mpress: democratizing billion-scale model training on multi-GPU servers via memory-saving inter-operator parallelism","author":"Zhou","year":"2023"},{"key":"mlstad9667bib41","doi-asserted-by":"crossref","DOI":"10.1145\/3458817.3476205","article-title":"Zero-infinity: breaking the GPU memory wall for extreme scale deep learning","author":"Rajbhandari","year":"2021"},{"key":"mlstad9667bib42","article-title":"Unifying data, model and hybrid parallelism in deep learning via tensor tiling","author":"Wang","year":"2018"},{"key":"mlstad9667bib43","article-title":"Llama: open and efficient foundation language models","author":"Touvron","year":"2023"},{"key":"mlstad9667bib44","doi-asserted-by":"publisher","DOI":"10.1080\/10538712.2011.575445","article-title":"Choice of plausible alternatives: an evaluation of commonsense causal reasoning","author":"Roemmele","year":"2011"},{"key":"mlstad9667bib45","doi-asserted-by":"crossref","DOI":"10.1145\/3458817.3476209","article-title":"Efficient large-scale language model training on GPU clusters using Megatron-LM","author":"Narayanan","year":"2021"},{"key":"mlstad9667bib46","article-title":"The forward-forward algorithm: some preliminary investigations","author":"Hinton","year":"2022"},{"key":"mlstad9667bib47","doi-asserted-by":"publisher","first-page":"3294","DOI":"10.1109\/TPDS.2023.3323282","article-title":"Communication optimization algorithms for distributed deep learning systems: a survey","volume":"vol 34","author":"Yu","year":"2023"}],"container-title":["Machine Learning: Science and Technology"],"original-title":[],"link":[{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667","content-type":"text\/html","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"similarity-checking"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,4]],"date-time":"2024-12-04T12:51:58Z","timestamp":1733316718000},"score":1,"resource":{"primary":{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ad9667"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,1]]},"references-count":47,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,12,4]]},"published-print":{"date-parts":[[2024,12,1]]}},"URL":"https:\/\/doi.org\/10.1088\/2632-2153\/ad9667","relation":{},"ISSN":["2632-2153"],"issn-type":[{"value":"2632-2153","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,12,1]]},"assertion":[{"value":"ZeROf-Offload: forward-gradient scheme for efficient full parameter fine-tuning of billion-scale language models","name":"article_title","label":"Article Title"},{"value":"Machine Learning: Science and Technology","name":"journal_title","label":"Journal Title"},{"value":"paper","name":"article_type","label":"Article Type"},{"value":"\u00a9 2024 The Author(s). Published by IOP Publishing Ltd","name":"copyright_information","label":"Copyright Information"},{"value":"2024-06-20","name":"date_received","label":"Date Received","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2024-11-22","name":"date_accepted","label":"Date Accepted","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2024-12-04","name":"date_epub","label":"Online publication date","group":{"name":"publication_dates","label":"Publication dates"}}]}}