{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,6]],"date-time":"2026-02-06T04:30:54Z","timestamp":1770352254690,"version":"3.49.0"},"reference-count":26,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,2,15]],"date-time":"2024-02-15T00:00:00Z","timestamp":1707955200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2024,3,31]]},"abstract":"<jats:p>Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However, there is an alarmingly low hardware utilization (5\u201320%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow\u2019s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely the most important workload driving large development investments in all aspects of computing stack.<\/jats:p>","DOI":"10.1145\/3635867","type":"journal-article","created":{"date-parts":[[2023,12,21]],"date-time":"2023-12-21T11:51:46Z","timestamp":1703159506000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems"],"prefix":"10.1145","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9975-4819","authenticated-orcid":false,"given":"Newsha","family":"Ardalani","sequence":"first","affiliation":[{"name":"Meta, Inc., USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8777-8573","authenticated-orcid":false,"given":"Saptadeep","family":"Pal","sequence":"additional","affiliation":[{"name":"UCLA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6188-1134","authenticated-orcid":false,"given":"Puneet","family":"Gupta","sequence":"additional","affiliation":[{"name":"UCLA, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,2,15]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"OpenAI. AI and Compute. ([n. d.]). https:\/\/openai.com\/blog\/ai-and-compute\/"},{"key":"e_1_3_2_3_2","article-title":"Accelerating Software 2.0","author":"Olukotun Kunle","year":"2020","unstructured":"Kunle Olukotun. 2020. Accelerating Software 2.0. ScaledML (2020).","journal-title":"ScaledML"},{"key":"e_1_3_2_4_2","article-title":"Beyond data and model parallelism for deep neural networks","author":"Jia Zhihao","year":"2018","unstructured":"Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).","journal-title":"arXiv preprint arXiv:1807.05358"},{"key":"e_1_3_2_5_2","unstructured":"Amazon AWS Inferentia. (accessed Sep. 10 2021). Achieve 12x Higher Throughput and Lowest Latency for PyTorch Natural Language Processing Applications out-of-the-Box on AWS Inferentia. https:\/\/tinyurl.com\/3mbuetmr (accessed Sep. 10 2021)."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2985963"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446762"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.29"},{"key":"e_1_3_2_10_2","first-page":"337","volume-title":"2020 USENIX Annual Technical Conference (USENIX ATC 20)","author":"Zhu Hongyu","year":"2020","unstructured":"Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. 2020. Daydream: Accurately estimating the efficacy of optimizations for \\(\\lbrace\\) DNN \\(\\rbrace\\) training. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 337\u2013352."},{"key":"e_1_3_2_11_2","first-page":"503","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Geoffrey X. Yu","year":"2021","unstructured":"X. Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habitat: A \\(\\lbrace\\) Runtime-Based \\(\\rbrace\\) computational performance predictor for deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 503\u2013521."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS48437.2020.00018"},{"key":"e_1_3_2_13_2","unstructured":"William Won Taekyung Heo Saeed Rashidi Srinivas Sridharan Sudarshan Srinivasan and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. (2023). arxiv:cs.DC\/2303.14006"},{"key":"e_1_3_2_14_2","volume-title":"System-Level Interconnect - Problems and Pathfinding Workshop (SLIP \u201920)","author":"Pal Saptadeep","year":"2020","unstructured":"Saptadeep Pal and Puneet Gupta. 2020. Pathfinding for 2.5D interconnect technologies. In System-Level Interconnect - Problems and Pathfinding Workshop (SLIP \u201920). ACM, New York, NY, USA, 8."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/1498765.1498785"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2616357"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295710"},{"key":"e_1_3_2_18_2","unstructured":"Deep Learning\u2019s Diminishing Returns. ([n. d.]). https:\/\/spectrum.ieee.org\/deep-learning-computational-costAccessed: 2021-10-15."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.vlsi.2017.02.002"},{"key":"e_1_3_2_20_2","unstructured":"Wikichip: Technology Node. ([n. d.]). https:\/\/en.wikichip.org\/wiki\/technology_nodeAccessed: 2021-10-15."},{"key":"e_1_3_2_21_2","unstructured":"HBM3: Big Impact on Chip Design. ([n. d.]). https:\/\/semiengineering.com\/hbm3s-impact-on-chip-design\/Accessed: 2021-10-15."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3544216.3544265"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","unstructured":"Ryohei Urata Hong Liu Kevin Yasumura Erji Mao Jill Berger Xiang Zhou Cedric Lam Roy Bannon Darren Hutchinson Daniel Nelson Leon Poutievski Arjun Singh Joon Ong and Amin Vahdat. 2022. Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale. (2022). DOI:10.48550\/ARXIV.2208.10041","DOI":"10.48550\/ARXIV.2208.10041"},{"key":"e_1_3_2_24_2","unstructured":"NVIDIA. 2022. NVLink and NVSwitch. https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/ (2022)."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080231"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00042"},{"key":"e_1_3_2_27_2","unstructured":"Tesla Dojo. ([n. d.]). https:\/\/www.nextplatform.com\/2022\/08\/23\/inside-teslas-innovative-and-homegrown-dojo-ai-supercomputer\/Accessed: 2022-10-15."}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3635867","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3635867","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:57:00Z","timestamp":1750291020000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3635867"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,2,15]]},"references-count":26,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,3,31]]}},"alternative-id":["10.1145\/3635867"],"URL":"https:\/\/doi.org\/10.1145\/3635867","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"value":"1084-4309","type":"print"},{"value":"1557-7309","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,2,15]]},"assertion":[{"value":"2023-06-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-11-04","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-02-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}