{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,17]],"date-time":"2025-12-17T13:17:27Z","timestamp":1765977447830,"version":"3.48.0"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"name":"Theme-based Research Scheme","award":["T45-701\/22-R"],"award-info":[{"award-number":["T45-701\/22-R"]}]},{"name":"AVNET-HKU Emerging Microelectronics and Ubiquitous Systems (EMUS) Lab"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2026,3,31]]},"abstract":"<jats:p>\n                    The acceleration of inference process for deep learning models is closely tied with the parallelization capability of computational graph operators and the parallel scheduling strategies. Most existing deep learning compilers focus on optimizing intra-operator parallelism, while neglecting inter-operator parallelism. Furthermore, most industrial inference engines, such as PyTorch and TensorFlow, utilize a dataflow-based model to describe tasks and schedule operators. They are computationally expensive and operate in a topological order and are parallelized to run within a single CUDA stream. However, they fail to fully exploit the parallelism capabilities of multiple CUDA streams. In this article, we propose PPD, a portable, highly parallel dispatching system. It boosts the inference performance by dividing the computational graph into multiple taskflow-based subgraphs. Additionally, PPD entails a dispatching algorithm on a single GPU with multiple CUDA streams to enhance the parallelism and performance of model inference. PPD offers users a lightweight model definition and an inference C++ interface, allowing for seamless integration into any context. We also verify the feasibility of PPD on AMD and other graphics cards. We validate our approach on widely adopted neural network models with varying degrees of parallelism, and compare it with industrial inference engines. Experiments demonstrate that PPD outperforms SOTA methods by up to\n                    <jats:bold>2.28\u00d7<\/jats:bold>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3773039","type":"journal-article","created":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T10:59:45Z","timestamp":1761562785000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["PPD: A Portable and Highly Parallel Dispatching System for Deep Learning"],"prefix":"10.1145","volume":"31","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5857-7393","authenticated-orcid":false,"given":"Wendong","family":"Xu","sequence":"first","affiliation":[{"name":"Electrical and Electronic Engineering, The University of Hong Kong","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9376-753X","authenticated-orcid":false,"given":"Yuhao","family":"Ji","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5337-1783","authenticated-orcid":false,"given":"Yang","family":"Bai","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8874-6269","authenticated-orcid":false,"given":"Yueting","family":"Li","sequence":"additional","affiliation":[{"name":"Beihang University","place":["Beijing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5995-4763","authenticated-orcid":false,"given":"Yuxuan","family":"Zhao","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7968-9469","authenticated-orcid":false,"given":"Zhengwu","family":"Liu","sequence":"additional","affiliation":[{"name":"Electrical and Electronic Engineering, The University of Hong Kong","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6406-4810","authenticated-orcid":false,"given":"Bei","family":"Yu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The Chinese University of Hong Kong","place":["Hong Kong, Hong Kong"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3026-0108","authenticated-orcid":false,"given":"Ngai","family":"Wong","sequence":"additional","affiliation":[{"name":"Electrical and Electronic Engineering, The University of Hong Kong","place":["Hong Kong, Hong Kong"]}]}],"member":"320","published-online":{"date-parts":[[2025,12,17]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"265","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et\u00a0al. 2016. [TensorFlow]: A system for [Large-Scale] machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 265\u2013283."},{"key":"e_1_3_2_3_2","unstructured":"A. Paszke S. Gross F. Massa A. Lerer J. Bradbury G. Chanan and S. Chintala. 2019. Pytorch: An imperative style high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019)."},{"key":"e_1_3_2_4_2","unstructured":"Tianqi Chen Mu Li Yutian Li Min Lin Naiyan Wang Minjie Wang Tianjun Xiao Bing Xu Chiyuan Zhang and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Retrieved from https:\/\/arxiv.org\/abs\/1512.01274"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2020.3030548"},{"key":"e_1_3_2_6_2","unstructured":"T. Chen T. Moreau Z. Jiang H. Shen E. Q. Yan L. Wang and A. Krishnamurthy. 2018. TVM: end-to-end optimization stack for deep learning. 11 (2018) 20. arXiv preprint arXiv:1802.04799."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.2983686"},{"key":"e_1_3_2_9_2","unstructured":"Xu Qin and Zhilin Wang. 2019. Nasnet: A neuron attention stage-by-stage net for single image deraining. arXiv:1912.03151. Retrieved from https:\/\/arxiv.org\/abs\/1912.03151"},{"key":"e_1_3_2_10_2","first-page":"19","volume-title":"Proceedings of the European Conference on Computer Vision (ECCV)","author":"Liu Chenxi","year":"2018","unstructured":"Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. 2018. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV). 19\u201334."},{"key":"e_1_3_2_11_2","first-page":"881","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Ma Lingxiao","year":"2020","unstructured":"Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with [rTasks]. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI). 881\u2013897."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-015-1483-z"},{"key":"e_1_3_2_13_2","first-page":"1","volume-title":"Proceedings of the ACM\/IEEE International Symposium on Code Generation and Optimization (CGO)","author":"Wu Peng","year":"2023","unstructured":"Peng Wu. 2023. PyTorch 2.0: The journey to bringing compiler technologies to the core of PyTorch (keynote). In Proceedings of the ACM\/IEEE International Symposium on Code Generation and Optimization (CGO). 1\u20131."},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/3315508.3329973","volume-title":"Proceedings of the ACM SIGPLAN International Workshop on Machine Learning and Programming Languages","author":"Tillet Philippe","year":"2019","unstructured":"Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10\u201319."},{"key":"e_1_3_2_15_2","first-page":"370","volume-title":"Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)","author":"Ding Yaoyao","year":"2023","unstructured":"Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Hidet: Task-mapping programming paradigm for deep learning tensor programs. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 370\u2013384."},{"key":"e_1_3_2_16_2","unstructured":"Team Hidet. 2023. Introducing hidet: A deep learning compiler for efficient model serving. Retrieved from https:\/\/pytorch.org\/blog\/introducing-hidet\/"},{"key":"e_1_3_2_17_2","unstructured":"Vijay Thakkar Pradeep Ramani Cris Cecka Aniket Shivam Honghao Lu Ethan Yan Jack Kosaian Mark Hoemmen Haicheng Wu Andrew Kerr Matt Nicely Duane Merrill Dustyn Blasig Fengqi Qiao Piotr Majcher Paul Springer Markus Hohnerbach Jin Wang and Manish Gupta. 2023. CUTLASS. (Jan.2023). Retrieved from https:\/\/github.com\/NVIDIA\/cutlass"},{"key":"e_1_3_2_18_2","first-page":"981","volume-title":"Proceedings of the IEEE International Conference on Parallel and Distributed Systems (ICPADS)","author":"Sourouri Mohammed","year":"2014","unstructured":"Mohammed Sourouri, Tor Gillberg, Scott B. Baden, and Xing Cai. 2014. Effective multi-GPU communication using multiple CUDA streams and threads. In Proceedings of the IEEE International Conference on Parallel and Distributed Systems (ICPADS). 981\u2013986."},{"issue":"3","key":"e_1_3_2_19_2","doi-asserted-by":"crossref","first-page":"510","DOI":"10.31577\/cai_2020_3_510","article-title":"Investigation of parallel data processing using hybrid high performance CPU+ GPU systems and CUDA streams","volume":"39","author":"Czarnul Pawe\u0142","year":"2020","unstructured":"Pawe\u0142 Czarnul. 2020. Investigation of parallel data processing using hybrid high performance CPU+ GPU systems and CUDA streams. Computing and Informatics 39, 3 (2020), 510\u2013536.","journal-title":"Computing and Informatics"},{"key":"e_1_3_2_20_2","unstructured":"Jan Novotn\u1ef3 Karel Ad\u00e1mek and Wes Armour. 2021. Implementing CUDA streams into astroaccelerate\u2014a case study. arXiv:2101.00941. Retrieved from https:\/\/arxiv.org\/abs\/2101.00941"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.swevo.2022.101153"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2023.102888"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41406.2024.00049"},{"key":"e_1_3_2_24_2","first-page":"8343","article-title":"Nimble: Lightweight and parallel GPU task scheduling for deep learning","volume":"33","author":"Kwon Woosuk","year":"2020","unstructured":"Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, and Byung-Gon Chun. 2020. Nimble: Lightweight and parallel GPU task scheduling for deep learning. Annual Conference on Neural Information Processing Systems 33 (2020), 8343\u20138354.","journal-title":"Annual Conference on Neural Information Processing Systems"},{"key":"e_1_3_2_25_2","unstructured":"ONNX Runtime developers. 2018. ONNX Runtime. (Nov.2018). Retrieved from https:\/\/github.com\/microsoft\/onnxruntime"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-16-2233-5"},{"key":"e_1_3_2_27_2","first-page":"255","volume-title":"Proceedings of the IEEE Workshop on Multimedia Signal Processing","author":"Libal Vit","year":"2007","unstructured":"Vit Libal, Jonathan Connell, Gerasimos Potamianos, and Etienne Marcheret. 2007. An embedded system for in-vehicle visual speech activity detection. In Proceedings of the IEEE Workshop on Multimedia Signal Processing. IEEE, 255\u2013258."},{"issue":"3","key":"e_1_3_2_28_2","doi-asserted-by":"crossref","first-page":"127","DOI":"10.5626\/JCSE.2018.12.3.127","article-title":"Vision-based blind spot monitoring using rear-view camera and its real-time implementation in an embedded system","volume":"12","author":"Jung Kyeong Hoon","year":"2018","unstructured":"Kyeong Hoon Jung and Kang Yi. 2018. Vision-based blind spot monitoring using rear-view camera and its real-time implementation in an embedded system. Journal of Computing Science and Engineering 12, 3 (2018), 127\u2013138.","journal-title":"Journal of Computing Science and Engineering"},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1007\/978-3-319-09387-1_3","volume-title":"Proceedings of the Advances in Embedded Computer Vision","author":"Nikoli\u0107 Zoran","year":"2014","unstructured":"Zoran Nikoli\u0107. 2014. Embedded vision in advanced driver assistance systems. In Proceedings of the Advances in Embedded Computer Vision. Springer, 45\u201369."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/MWC.001.2000374"},{"key":"e_1_3_2_31_2","doi-asserted-by":"crossref","unstructured":"A. Sinha A. Sharma L. A. P. Melek and D. Caviglia (Eds.). 2023. Smart embedded systems: Advances and applications.","DOI":"10.1201\/9781032628059"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3058014"},{"key":"e_1_3_2_33_2","unstructured":"Antonio Polino Razvan Pascanu and Dan Alistarh. 2018. Model compression via distillation and quantization. arXiv:1802.05668. Retrieved from https:\/\/arxiv.org\/abs\/1802.05668"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3623402"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1201\/9781003162810-13"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2021.3098483"},{"key":"e_1_3_2_37_2","unstructured":"Dayou Du Gu Gong and Xiaowen Chu. 2024. Model quantization and hardware acceleration for vision transformers: A comprehensive survey. arXiv:2405.00314. Retrieved from https:\/\/arxiv.org\/abs\/2405.00314"},{"key":"e_1_3_2_38_2","volume-title":"CUDA by Example: An Introduction to General-purpose GPU Programming","author":"Sanders Jason","year":"2010","unstructured":"Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-purpose GPU Programming. Addison-Wesley Professional."},{"key":"e_1_3_2_39_2","doi-asserted-by":"crossref","unstructured":"H. Yang S. Shen Z. Liu and Y. Zhao. 2023. cuXCMP: CUDA-accelerated private comparison based on homomorphic encryption. IEEE Transactions on Information Forensics and Security 19 (2023) 3581\u20133592.","DOI":"10.1109\/TIFS.2023.3267677"},{"key":"e_1_3_2_40_2","first-page":"875","volume-title":"Proceedings of the ACM\/IEEE International Conference for High Performance Computing Networking, Storage, and Analysis (SC)","author":"Tang Shanjiang","year":"2016","unstructured":"Shanjiang Tang, BingSheng He, Shuhao Zhang, and Zhaojie Niu. 2016. Elastic multi-resource fairness: balancing fairness and efficiency in coupled CPU-GPU architectures. In Proceedings of the ACM\/IEEE International Conference for High Performance Computing Networking, Storage, and Analysis (SC). 875\u2013886."},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3115630"},{"key":"e_1_3_2_42_2","first-page":"435","volume-title":"Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par)","author":"Lin Dian-Lun","year":"2021","unstructured":"Dian-Lun Lin and Tsung-Wei Huang. 2021. Efficient GPU computation using task graph parallelism. In Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par). 435\u2013450."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3104255"},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1145\/3502181.3533714","volume-title":"Proceedings of the ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC)","author":"Chiu Cheng-Hsiang","year":"2022","unstructured":"Cheng-Hsiang Chiu and Tsung-Wei Huang. 2022. Composing pipeline parallelism using control taskflow graph. In Proceedings of the ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 283\u2013284."},{"key":"e_1_3_2_45_2","first-page":"64","volume-title":"Proceedings of the IEEE International Conference on Parallel and Distributed Systems (ICPADS)","author":"Lin Chun-Xun","year":"2020","unstructured":"Chun-Xun Lin, Tsung-Wei Huang, and Martin DF Wong. 2020. An efficient work-stealing scheduler for task dependency graph. In Proceedings of the IEEE International Conference on Parallel and Distributed Systems (ICPADS). 64\u201371."},{"key":"e_1_3_2_46_2","unstructured":"Chris Leary and Todd Wang.2017. XLA: TensorFlow compiled. (2017)."},{"key":"e_1_3_2_47_2","first-page":"167","article-title":"IOS: Inter-operator scheduler for cnn acceleration","volume":"3","author":"Ding Yaoyao","year":"2021","unstructured":"Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han. 2021. IOS: Inter-operator scheduler for cnn acceleration. Machine Learning and Systems 3 (2021), 167\u2013180.","journal-title":"Machine Learning and Systems"},{"key":"e_1_3_2_48_2","volume-title":"Intel VTune Profiler","author":"Corporation Intel","year":"2018","unstructured":"Intel Corporation. 2018. Intel VTune Profiler. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/tools\/oneapi\/vtune-profiler.html\/"},{"key":"e_1_3_2_49_2","volume-title":"AMD HIP Documentation","author":"Inc. AMD","year":"2025","unstructured":"AMD Inc.2025. AMD HIP Documentation. Retrieved from https:\/\/rocm.docs.amd.com\/projects\/HIP\/en\/latest\/"},{"key":"e_1_3_2_50_2","unstructured":"([n. d.]). Retrieved from https:\/\/denglinai.com"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_52_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_2_53_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et\u00a0al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_2_54_2","first-page":"234","volume-title":"Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)","author":"Ronneberger Olaf","year":"2015","unstructured":"Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, 234\u2013241."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01042"},{"key":"e_1_3_2_56_2","unstructured":"Black Forest Labs Stephen Batifol Andreas Blattmann Frederic Boesel Saksham Consul Cyril Diagne Tim Dockhorn Jack English Zion English Patrick Esser Sumith Kulal Kyle Lacey Yam Levi Cheng Li Dominik Lorenz Jonas M\u00fcller Dustin Podell Robin Rombach Harry Saini Axel Sauer and Luke Smith. 2025. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. arXiv:2506.15742. Retrieved from https:\/\/arxiv.org\/abs\/2506.15742"}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3773039","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,17]],"date-time":"2025-12-17T13:15:19Z","timestamp":1765977319000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3773039"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,17]]},"references-count":55,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,3,31]]}},"alternative-id":["10.1145\/3773039"],"URL":"https:\/\/doi.org\/10.1145\/3773039","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"type":"print","value":"1084-4309"},{"type":"electronic","value":"1557-7309"}],"subject":[],"published":{"date-parts":[[2025,12,17]]},"assertion":[{"value":"2025-03-03","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-15","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-12-17","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}