{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:05:11Z","timestamp":1750309511328,"version":"3.41.0"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,12,10]],"date-time":"2024-12-10T00:00:00Z","timestamp":1733788800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency","award":["FA8650-18-2-7860"],"award-info":[{"award-number":["FA8650-18-2-7860"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2025,1,31]]},"abstract":"<jats:p>In this study, we introduce a methodology for automatically transforming user applications written in C\/C++ to a parallel representation consisting of coarse-grained tasks based on dynamic profiling. Such a parallel representation is suitable for mapping applications onto heterogeneous SoCs. We present our approach for instrumenting the user application binary during the compilation process with parallel primitives that enable the runtime system to schedule and execute independent computation-intensive coarse-grained tasks concurrently. We use the proposed compilation and code transformation methodology to retarget each application for execution on a heterogeneous SoC composed of processor cores and accelerators. We demonstrate the capabilities of our integrated compile time and runtime flow through task-level parallelization and functionally correct execution of real-world applications in the communication systems and radar processing domains. We demonstrate the functionality of our integrated system by executing six distinct applications with different degrees of parallelism on four different platforms: an eight-core general-purpose processor, a heterogeneous SoC simulator, and two heterogeneous SoCs utilizing the Xilinx Zynq UltraScale+ FPGA and the Nvidia Jetson AGX board. Our integrated approach offers a path forward for application developers to take full advantage of the target SoC without requiring users to become hardware or parallel programming experts.<\/jats:p>","DOI":"10.1145\/3704635","type":"journal-article","created":{"date-parts":[[2024,11,15]],"date-time":"2024-11-15T10:18:07Z","timestamp":1731665887000},"page":"1-32","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Coarse-Grained Task Parallelization by Dynamic Profiling for Heterogeneous SoC-Based Embedded System"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-2067-0306","authenticated-orcid":false,"given":"Liangliang","family":"Chang","sequence":"first","affiliation":[{"name":"School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8163-1191","authenticated-orcid":false,"given":"Serhan","family":"Gener","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, The University of Arizona, Tucson, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1066-5578","authenticated-orcid":false,"given":"Joshua","family":"Mack","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, The University of Arizona, Tucson, United States"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-5398-5708","authenticated-orcid":false,"given":"Hasan Umut","family":"Suluhan","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, The University of Arizona, Tucson, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7982-8991","authenticated-orcid":false,"given":"Ali","family":"Akoglu","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering Department, The University of Arizona, Tucson, United States"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9859-7778","authenticated-orcid":false,"given":"Chaitali","family":"Chakrabarti","sequence":"additional","affiliation":[{"name":"School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, United States"}]}],"member":"320","published-online":{"date-parts":[[2024,12,10]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Mart\u00edn Abadi Paul Barham Jianmin Chen Zhifeng Chen Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Geoffrey Irving Michael Isard Manjunath Kudlur Josh Levenberg Rajat Monga Sherry Moore Derek G. Murray Benoit Steiner Paul Tucker Vijay Vasudevan Pete Warden Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah GA USA) (OSDI\u201916). USENIX Association USA 265\u2013283."},{"issue":"8","key":"e_1_3_1_3_2","first-page":"1248","article-title":"DS3: A system-level domain-specific system-on-chip simulation framework","volume":"69","author":"Arda Samet E.","year":"2020","unstructured":"Samet E. Arda, Anish Krishnakumar, A. Alper Goksoy, Nirmal Kumbhare, Joshua Mack, Anderson L. Sartor, Ali Akoglu, Radu Marculescu, and Umit Y. Ogras. 2020. DS3: A system-level domain-specific system-on-chip simulation framework. IEEE Trans. Comput. 69, 8 (2020), 1248\u20131262.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1631"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/183432.183527"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","unstructured":"D. W. Bliss T. Ajayi A. Akoglu I. Aliyev T. Basaklar L. Belayneh D. Blaauw J. Brunhaver C. Chakrabarti L. Chang K.-Y. Chen M.-H. Chen X. Chen A. R. Chiriyath A. Daftardar R. Dreslinski A. Dutta A. J. Farcas Y. Fu A. Goksoy X. He Md. S. Hassan A. Herschfelt J. Holtom H.-S. Kim A. N. Krishnakumar Y. Li O. Ma J. Mack S. Mallik S. K. Mandal R. Marculescu B. McCall T. Mudge U. Y. Ogras V. Pandey S. Siddiqui Y.-H. Sun A. Venkataramani X. Wei B. R. Willis H. Yu and Y. Yue. 2022. Enabling software-defined RF convergence with a novel coarse-scale heterogeneous processor. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS). 443\u2013447. 10.1109\/ISCAS48785.2022.9937602","DOI":"10.1109\/ISCAS48785.2022.9937602"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.5555\/263953"},{"key":"e_1_3_1_8_2","first-page":"913","volume-title":"2022 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA\/BDCloud\/SocialCom\/SustainCom\u201922)","author":"Chang Liangliang","year":"2022","unstructured":"Liangliang Chang, Joshua Mack, Benjamin Willis, Xing Chen, John Brunhaver, Ali Akoglu, and Chaitali Chakrabarti. 2022. Profile-guided parallel task extraction and execution for domain specific heterogeneous SoC. In 2022 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA\/BDCloud\/SocialCom\/SustainCom\u201922). IEEE, 913\u2013920."},{"key":"e_1_3_1_9_2","first-page":"204","volume-title":"2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201921)","author":"Chi Yuze","year":"2021","unstructured":"Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, and Jason Cong. 2021. Extending high-level synthesis for task-parallel programs. In 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201921). IEEE, 204\u2013213."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/99.660313"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2019.01.006"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358312"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2968456.2968459"},{"issue":"07","key":"e_1_3_1_14_2","doi-asserted-by":"crossref","first-page":"478","DOI":"10.1109\/TC.1981.1675827","article-title":"Trace scheduling: A technique for global microcode compaction","volume":"30","author":"Fisher Joseph A.","year":"1981","unstructured":"Joseph A. Fisher. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers 30, 07 (1981), 478\u2013490.","journal-title":"IEEE Transactions on Computers"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/1993316.1993553"},{"key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"437","DOI":"10.1109\/MICRO.2012.47","volume-title":"2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Ketterlin Alain","year":"2012","unstructured":"Alain Ketterlin and Philippe Clauss. 2012. Profiling data-dependence to assist parallelization: Framework, scope, and optimization. In 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE, 437\u2013448."},{"key":"e_1_3_1_17_2","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1109\/MICRO56248.2022.00017","volume-title":"2022 55th IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201922)","author":"Khan Tanvir Ahmed","year":"2022","unstructured":"Tanvir Ahmed Khan, Muhammed Ugur, Krishnendra Nathella, Dam Sunwoo, Heiner Litz, Daniel A Jim\u00e9nez, and Baris Kasikci. 2022. Whisper: Profile-guided branch misprediction elimination for data center applications. In 2022 55th IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201922). IEEE, 19\u201334."},{"key":"e_1_3_1_18_2","first-page":"734","volume-title":"2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA\u201921)","author":"Khan Tanvir Ahmed","year":"2021","unstructured":"Tanvir Ahmed Khan, Dexin Zhang, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2021. Ripple: Profile-guided instruction cache replacement for data center applications. In 2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA\u201921). IEEE, 734\u2013747."},{"key":"e_1_3_1_19_2","first-page":"535","volume-title":"2010 43rd IEEE\/ACM International Symposium on Microarchitecture","author":"Kim Minjang","year":"2010","unstructured":"Minjang Kim, Hyesoon Kim, and Chi-Keung Luk. 2010. SD3: A scalable approach to dynamic data-dependence profiling. In 2010 43rd IEEE\/ACM International Symposium on Microarchitecture. IEEE, 535\u2013546."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3178487.3178493"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2004.1281665"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3529257"},{"issue":"5","key":"e_1_3_1_23_2","first-page":"48:1\u201348:26","article-title":"SEAMS: Self-optimizing runtime manager for approximate memory hierarchies","volume":"20","author":"Maity Biswadip","year":"2021","unstructured":"Biswadip Maity, Bryan Donyanavard, Anmol Surhonne, Amir Rahmani, Andreas Herkersdorf, and Nikil Dutt. 2021. SEAMS: Self-optimizing runtime manager for approximate memory hierarchies. ACM Transactions on Embedded Computing Systems 20, 5 (July2021), 48:1\u201348:26. DOI:https:\/\/doi.org\/10\/gm3hnz","journal-title":"ACM Transactions on Embedded Computing Systems"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3358203"},{"key":"e_1_3_1_25_2","first-page":"2","volume-title":"2019 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919)","author":"Panchenko Maksim","year":"2019","unstructured":"Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A practical binary optimizer for data centers and beyond. In 2019 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919). IEEE, 2\u201314."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/2499370.2462176"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3391899"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3571133"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2907493"},{"key":"e_1_3_1_30_2","article-title":"Automated parallel kernel extraction from dynamic application traces","author":"Uhrie Richard","year":"2020","unstructured":"Richard Uhrie, Chaitali Chakrabarti, and John Brunhaver. 2020. Automated parallel kernel extraction from dynamic application traces. arXiv preprint arXiv:2001.09995 (2020).","journal-title":"arXiv preprint arXiv:2001.09995"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3612937"},{"issue":"1","key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2579561","article-title":"Integrating profile-driven parallelism detection and machine-learning-based mapping","volume":"11","author":"Wang Zheng","year":"2014","unstructured":"Zheng Wang, Georgios Tournavitis, Bj\u00f6rn Franke, and Michael F. P. O\u2019boyle. 2014. Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Transactions on Architecture and Code Optimization (TACO) 11, 1 (2014), 1\u201326.","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO)"},{"volume-title":"ZCU102 Evaluation Board","key":"e_1_3_1_33_2","unstructured":"Xilinx ZCU102 [n. d.]. ZCU102 Evaluation Board. Retrieved May 3, 2024 from https:\/\/docs.amd.com\/v\/u\/en-US\/ug1182-zcu102-eval-bd"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3580394"},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","first-page":"999","DOI":"10.1145\/3620665.3640396","volume-title":"29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2","author":"Zhang Yuxuan","year":"2024","unstructured":"Yuxuan Zhang, Nathan Sobotka, Soyoon Park, Saba Jamilan, Tanvir Ahmed Khan, Baris Kasikci, Gilles A. Pokam, Heiner Litz, and Joseph Devietti. 2024. RPG2: Robust profile-guided runtime prefetch generation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 999\u20131013."}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3704635","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3704635","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:58Z","timestamp":1750295878000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3704635"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,12,10]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1,31]]}},"alternative-id":["10.1145\/3704635"],"URL":"https:\/\/doi.org\/10.1145\/3704635","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2024,12,10]]},"assertion":[{"value":"2024-05-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-06","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}