{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,16]],"date-time":"2026-05-16T15:50:42Z","timestamp":1778946642235,"version":"3.51.4"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2022,9,30]],"date-time":"2022-09-30T00:00:00Z","timestamp":1664496000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Samsung Electronics Co., Ltd.","award":["IO201210-07941-01"],"award-info":[{"award-number":["IO201210-07941-01"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2022,9,30]]},"abstract":"<jats:p>As deep learning inference applications are increasing in embedded devices, an embedded device tends to equip neural processing units (NPUs) in addition to a multi-core CPU and a GPU. NVIDIA Jetson AGX Xavier is an example. For fast and efficient development of deep learning applications, TensorRT is provided as the SDK for high-performance inference, including an optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. Like most deep learning frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU or NPU, not both. In this article, we present a TensorRT-based framework supporting various optimization parameters to accelerate a deep learning application targeted on an NVIDIA Jetson embedded platform with heterogeneous processors, including multi-threading, pipelining, buffer assignment, and network duplication. Since the design space of allocating layers to diverse processing elements and optimizing other parameters is huge, we devise a parameter optimization methodology that consists of a heuristic for balancing pipeline stages among heterogeneous processors and fine-tuning the process for optimizing parameters. With nine real-life benchmarks, we could achieve 101%~680% performance improvement and up to 55% energy reduction over the baseline inference using a GPU only.<\/jats:p>","DOI":"10.1145\/3508391","type":"journal-article","created":{"date-parts":[[2022,1,26]],"date-time":"2022-01-26T18:20:38Z","timestamp":1643221238000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":93,"title":["TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9585-3369","authenticated-orcid":false,"given":"Eunjin","family":"Jeong","sequence":"first","affiliation":[{"name":"Seoul National University, Gwanak-ro, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jangryul","family":"Kim","sequence":"additional","affiliation":[{"name":"Seoul National University, Gwanak-ro, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Soonhoi","family":"Ha","sequence":"additional","affiliation":[{"name":"Seoul National University, Gwanak-ro, Seoul, Republic of Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,8]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Mart\u00edn Abadi Paul Barham Jianmin Chen Zhifeng Chen Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Geoffrey Irving Michael Isard Manjunath Kudlur Josh Levenberg Rajat Monga Sherry Moore Derek G. Murray Benoit Steiner Paul Tucker Vijay Vasudevan Pete Warden Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916) . 265\u2013283."},{"key":"e_1_3_2_3_2","unstructured":"Alexey Bochkovskiy Chien-Yao Wang and Hong-Yuan Mark Liao. 2020. YOLOv4: Optimal Speed and Accuracy of Object Detection. (2020). arXiv:cs.CV\/2004.10934. Retrieved from https:\/\/arxiv.org\/abs\/2004.10934."},{"key":"e_1_3_2_4_2","first-page":"578","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et\u00a0al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). 578\u2013594."},{"key":"e_1_3_2_5_2","unstructured":"CodaLab. 2021. Retrieved July 1 2021 from https:\/\/competitions.codalab.org\/."},{"key":"e_1_3_2_6_2","unstructured":"Densenet201+Yolo. 2020. Retrieved July 1 2021 from https:\/\/github.com\/AlexeyAB\/darknet\/."},{"key":"e_1_3_2_7_2","volume-title":"Patterns of Enterprise Application Architecture","author":"Fowler Martin","year":"2012","unstructured":"Martin Fowler. 2012. Patterns of Enterprise Application Architecture. Addison-Wesley."},{"key":"e_1_3_2_8_2","unstructured":"Google TensorFlow Lite. 2021. Retrieved July 1 2021 from https:\/\/www.tensorflow.org\/lite\/."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/2968455.2968511"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3081333.3081360"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/LES.2021.3087707"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2018.8342102"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3240765.3240786"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2977496"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPSN.2016.7460664"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2973801"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3081333.3081359"},{"key":"e_1_3_2_19_2","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1007\/978-3-030-60939-9_2","volume-title":"Embedded Computer Systems: Architectures, Modeling, and Simulation","author":"Minakova Svetlana","year":"2020","unstructured":"Svetlana Minakova, Erqian Tang, and Todor Stefanov. 2020. Combining task- and data-level parallelism for high-throughput CNN inference on embedded CPUs-GPUs MPSoCs. In Embedded Computer Systems: Architectures, Modeling, and Simulation, Alex Orailoglu, Matthias Jung, and Marc Reichenbach (Eds.). Springer International Publishing, Cham, 18\u201335."},{"key":"e_1_3_2_20_2","volume-title":"International Conference on Learning Representations","author":"Mirhoseini Azalia","year":"2018","unstructured":"Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le, and Jeff Dean. 2018. A hierarchical model for device placement. In International Conference on Learning Representations."},{"key":"e_1_3_2_21_2","unstructured":"NVIDIA Jetson Platforms. 2021. Retrieved July 1 2021 from https:\/\/www.nvidia.com\/en-us\/autonomous-machines\/embedded-systems\/."},{"key":"e_1_3_2_22_2","unstructured":"NVIDIA TensorRT. 2021. Retrieved July 1 2021 from https:\/\/developer.nvidia.com\/tensorrt\/."},{"key":"e_1_3_2_23_2","unstructured":"Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury Gregory Chanan Trevor Killeen Zeming Lin Natalia Gimelshein Luca Antiga Alban Desmaison Andreas Kopf Edward Yang Zachary DeVito Martin Raison Alykhan Tejani Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai and Soumith Chintala. 2019. PyTorch: An imperative style high-performance deep learning library. Advances in Neural Information Processing Systems 32."},{"key":"e_1_3_2_24_2","volume-title":"31st Euromicro Conference on Real-Time Systems (ECRTS\u201919)","volume":"23","author":"Pujol Roger","year":"2019","unstructured":"Roger Pujol, Hamid Tabani, Leonidas Kosmidis, Enrico Mezzetti, Jaume Abella Ferrer, and Francisco J. Cazorla. 2019. Generating and exploiting deep learning variants to increase heterogeneous resource utilization in the NVIDIA Xavier. In 31st Euromicro Conference on Real-Time Systems (ECRTS\u201919), Vol. 23."},{"key":"e_1_3_2_25_2","unstructured":"S. Rallapalli H. Qiu A. J. Bency S. Karthikeyan R. Govindan B. S. Manjunath and R. Urgaonkar. 2016. Are Very Deep Neural Networks Feasible on Mobile Devices? Technical Report. University of Southern California."},{"key":"e_1_3_2_26_2","unstructured":"Joseph Redmon. 2013\u20132016. Darknet: Open Source Neural Networks in C. Retrieved February 14 2021 from http:\/\/pjreddie.com\/darknet\/."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.690"},{"key":"e_1_3_2_28_2","article-title":"Yolov3: An incremental improvement","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv:1804.02767","journal-title":"arXiv:1804.02767"},{"key":"e_1_3_2_29_2","article-title":"Scheduling computation graphs of deep learning models on manycore CPUs","author":"Tang Linpeng","year":"2018","unstructured":"Linpeng Tang, Yida Wang, Theodore L. Willke, and Kai Li. 2018. Scheduling computation graphs of deep learning models on manycore CPUs. arXiv:1807.09667","journal-title":"arXiv:1807.09667"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ETFA46521.2020.9212130"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01283"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00203"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2019.2944584"},{"key":"e_1_3_2_34_2","first-page":"392","volume-title":"IEEE Real-Time Systems Symposium (RTSS\u201919)","author":"Xiang Yecheng","year":"2019","unstructured":"Yecheng Xiang and Hyoseung Kim. 2019. Pipelined data-parallel CPU\/GPU scheduling for multi-DNN real-time inference. In IEEE Real-Time Systems Symposium (RTSS\u201919). IEEE, 392\u2013405."},{"key":"e_1_3_2_35_2","article-title":"DAC-SDC low power object detection challenge for UAV applications","author":"Xu Xiaowei","year":"2021","unstructured":"Xiaowei Xu, Xinyi Zhang, Bei Yu, X Sharon Hu, Christopher Rowen, Jingtong Hu, and Yiyu Shi. 2021. DAC-SDC low power object detection challenge for UAV applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 2 (2021), 392\u2013403.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_36_2","first-page":"305","volume-title":"IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS\u201919)","author":"Yang Ming","year":"2019","unstructured":"Ming Yang, Shige Wang, Joshua Bakita, Thanh Vu, F. Donelson Smith, James H. Anderson, and Jan-Michael Frahm. 2019. Re-thinking CNN frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS\u201919). IEEE, 305\u2013317."},{"key":"e_1_3_2_37_2","first-page":"190","volume-title":"IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS\u201918)","author":"Zhou Husheng","year":"2018","unstructured":"Husheng Zhou, Soroush Bateni, and Cong Liu. 2018. S^3DNN: Supervised streaming and scheduling for GPU-accelerated real-time DNN workloads. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS\u201918). IEEE, 190\u2013201."}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3508391","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3508391","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:36Z","timestamp":1750182576000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3508391"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,9,30]]},"references-count":36,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2022,9,30]]}},"alternative-id":["10.1145\/3508391"],"URL":"https:\/\/doi.org\/10.1145\/3508391","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"value":"1539-9087","type":"print"},{"value":"1558-3465","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,9,30]]},"assertion":[{"value":"2021-07-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-12-26","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-10-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}