{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T22:54:17Z","timestamp":1777676057781,"version":"3.51.4"},"reference-count":56,"publisher":"SAGE Publications","issue":"5","license":[{"start":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T00:00:00Z","timestamp":1691712000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"funder":[{"DOI":"10.13039\/501100006280","name":"Spanish Ministry of Science and Technology","doi-asserted-by":"crossref","award":["PID2019-107255GB"],"award-info":[{"award-number":["PID2019-107255GB"]}],"id":[{"id":"10.13039\/501100006280","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2023,9]]},"abstract":"<jats:p>Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA\/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250\u00a0GHz (64 cores and 2 threads\/core, totalling 128 threads per node) and 2 \u00d7 GPU AMD Radeon Instinct MI50 with 32\u00a0GB, hybrid executions present speedups from 1.10\u00d7 up to 3.5\u00d7 with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.<\/jats:p>","DOI":"10.1177\/10943420231188079","type":"journal-article","created":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T03:20:00Z","timestamp":1691724000000},"page":"626-646","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":8,"title":["Heterogeneous programming using OpenMP and CUDA\/HIP for hybrid CPU-GPU scientific applications"],"prefix":"10.1177","volume":"37","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3780-1106","authenticated-orcid":false,"given":"Marc","family":"Gonzalez Tallada","sequence":"first","affiliation":[{"name":"Computer Architecture Department, Universitat Polit\u00e8cnica de Catalunya-BarcelonaTECH, Barcelona, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2403-8145","authenticated-orcid":false,"given":"Enric","family":"Morancho","sequence":"additional","affiliation":[{"name":"Computer Architecture Department, Universitat Polit\u00e8cnica de Catalunya-BarcelonaTECH, Barcelona, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2023,8,11]]},"reference":[{"key":"bibr1-10943420231188079","unstructured":"Abadi M, Agarwal A, Barham P, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous distributed systems. http:\/\/download.tensorflow.org\/paper\/whitepaper2015.pdf."},{"key":"bibr2-10943420231188079","unstructured":"Augonnet C, Thibault S, Namyst R (2010) StarPU: a runtime system for scheduling tasks over accelerator-based multicore machines. Research Report RR-7240, INRIA. https:\/\/hal.inria.fr\/inria-00467677."},{"key":"bibr3-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1631"},{"key":"bibr4-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2019.03.005"},{"key":"bibr5-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1177\/109434209100500306"},{"key":"bibr6-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1145\/2400682.2400716"},{"key":"bibr7-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.58"},{"key":"bibr8-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1007\/BFb0057877"},{"key":"bibr9-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2012.16"},{"issue":"2","key":"bibr10-10943420231188079","volume":"65","author":"Choi HJ","year":"2013","journal-title":"Systems"},{"key":"bibr11-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/PDP.2016.60"},{"key":"bibr12-10943420231188079","unstructured":"der Wijngaart RFV, Jin H (2003) NAS parallel benchmarks, multi-zone versions. Technical Report NAS-03-010. Moffett Field, CA: NASA Ames Research Center."},{"key":"bibr13-10943420231188079","first-page":"733","volume":"25","author":"D\u00fcmmler J","year":"2013","journal-title":"Advances in Parallel Computing"},{"key":"bibr14-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1145\/1088149.1088166"},{"key":"bibr15-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626411000151"},{"key":"bibr16-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-37658-0_7"},{"key":"bibr17-10943420231188079","volume-title":"Hybrid CPU\/GPU FE2 Multi-Scale Implementation Coupling Alya and Micropp","author":"Giuntoli G","year":"2019"},{"key":"bibr18-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2020.3015148"},{"key":"bibr19-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2021.08.001"},{"key":"bibr20-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2020.11.004"},{"key":"bibr21-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/TAP.2013.2258882"},{"key":"bibr22-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.1994.179"},{"key":"bibr23-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15291-7_23"},{"key":"bibr24-10943420231188079","doi-asserted-by":"crossref","volume-title":"An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters","author":"Jacobsen D","DOI":"10.2514\/6.2010-522"},{"key":"bibr25-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2012.10.002"},{"key":"bibr26-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/ICIINFS.2009.5429842"},{"key":"bibr27-10943420231188079","unstructured":"Kraus J (2013) An introduction to cuda-aware mpi. https:\/\/developer.nvidia.com\/blog\/introduction-cuda-aware-mpi\/.71."},{"key":"bibr28-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-21487-5_13"},{"key":"bibr29-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1145\/143095.143134"},{"key":"bibr30-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-9-S2-S10"},{"key":"bibr31-10943420231188079","volume-title":"Load Balancing vs. Locality Management in Shared-Memory Multiprocessors","author":"Markatos EP","year":"1991"},{"key":"bibr32-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/71.273046"},{"key":"bibr33-10943420231188079","volume-title":"MPI: A Message-Passing Interface Standard","author":"Message Passing Interface Forum","year":"1994"},{"key":"bibr34-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1145\/278839"},{"key":"bibr35-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2012.02.006"},{"key":"bibr36-10943420231188079","unstructured":"NVIDIA (2020) GPU-accelerated caffe. https:\/\/www.nvidia.com\/en-gb\/data-center\/gpu-accelerated-applications\/caffe\/."},{"key":"bibr37-10943420231188079","unstructured":"NVIDIA (2023) CUDA toolkit documentation 12.1. https:\/\/docs.nvidia.com\/cuda\/."},{"key":"bibr38-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2008.4536163"},{"key":"bibr39-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1177\/1094342011434065"},{"key":"bibr40-10943420231188079","first-page":"8862123","volume":"2020","author":"Pe\u00f1a AJ","year":"2020","journal-title":"Scientific Programming"},{"key":"bibr41-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1145\/1964218.1964223"},{"key":"bibr42-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2005.1493930"},{"key":"bibr43-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1987.5009495"},{"key":"bibr44-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.23"},{"key":"bibr45-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-07518-1_11"},{"key":"bibr46-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/SUPERC.1994.344281"},{"key":"bibr47-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2011.10.011"},{"key":"bibr48-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/71.205655"},{"key":"bibr49-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-3-0_5"},{"key":"bibr50-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/71.569656"},{"key":"bibr51-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2010.12"},{"key":"bibr52-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2010.06.035"},{"issue":"8","key":"bibr53-10943420231188079","volume":"48","author":"Yang C","year":"2013","journal-title":"Algorithm for Global Atmospheric Simulations"},{"key":"bibr54-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1016\/j.cageo.2021.104760"},{"key":"bibr55-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3066407"},{"key":"bibr56-10943420231188079","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2012.34"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420231188079","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/10943420231188079","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/10943420231188079","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T08:17:32Z","timestamp":1777450652000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/10943420231188079"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,11]]},"references-count":56,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2023,9]]}},"alternative-id":["10.1177\/10943420231188079"],"URL":"https:\/\/doi.org\/10.1177\/10943420231188079","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,11]]}}}