{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:22:06Z","timestamp":1750220526548,"version":"3.41.0"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2021,9,29]],"date-time":"2021-09-29T00:00:00Z","timestamp":1632873600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"EPSRC Centre for Doctoral Training in Pervasive Parallelism","award":["EP\/L01503X\/1"],"award-info":[{"award-number":["EP\/L01503X\/1"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,12,31]]},"abstract":"<jats:p>Existing OS techniques for homogeneous many-core systems make it simple for single and multithreaded applications to migrate between cores. Heterogeneous systems do not benefit so fully from this flexibility, and applications that cannot migrate in mid-execution may lose potential performance. The situation is particularly challenging when a switch of language runtime would be desirable in conjunction with a migration. We present a case study in making heterogeneous CPU + GPU systems more flexible in this respect. Our technique for fine-grained application migration, allows switches between OpenMP, OpenCL, and CUDA execution, in conjunction with migrations from GPU to CPU, and CPU to GPU. To achieve this, we subdivide iteration spaces into slices, and consider migration on a slice-by-slice basis. We show that slice sizes can be learned offline by machine learning models. To further improve performance, memory transfers are made migration-aware. The complexity of the migration capability is hidden from programmers behind a high-level programming model. We present a detailed evaluation of our mid-kernel migration mechanism with the First Come, First Served scheduling policy. We compare our technique in a focused evaluation scenario against idealized kernel-by-kernel scheduling, which is typical for current systems, and makes perfect kernel to device scheduling decisions, but cannot migrate kernels mid-execution. Models show that up to 1.33\u00d7 speedup can be achieved over these systems by adding fine-grained migration. Our experimental results with all nine applicable SHOC and Rodinia benchmarks achieve speedups of up to 1.30\u00d7 (1.08\u00d7 on average) over an implementation of a perfect but kernel-migration incapable scheduler when migrated to a faster device. Our mechanism and slice size choices introduce an average slowdown of only 2.44% if kernels never migrate. Lastly, our programming model reduces the code size by at least 88% if compared to manual implementations of migratable kernels.<\/jats:p>","DOI":"10.1145\/3471909","type":"journal-article","created":{"date-parts":[[2021,9,29]],"date-time":"2021-09-29T10:22:55Z","timestamp":1632910975000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Device Hopping"],"prefix":"10.1145","volume":"18","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1591-6778","authenticated-orcid":false,"given":"Paul","family":"Metzger","sequence":"first","affiliation":[{"name":"School of Informatics, University of Edinburgh, Edinburgh, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Volker","family":"Seeker","sequence":"additional","affiliation":[{"name":"School of Informatics, University of Edinburgh, Edinburgh, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Christian","family":"Fensch","sequence":"additional","affiliation":[{"name":"School of Informatics, University of Edinburgh, Edinburgh, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Murray","family":"Cole","sequence":"additional","affiliation":[{"name":"School of Informatics, University of Edinburgh, Edinburgh, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,9,29]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03869-3_80"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ECRTS.2012.15"},{"volume-title":"International Workshop on Performance, Portability and Productivity in HPC. IEEE, 71\u201381","author":"Beckingsale David Alexander","key":"e_1_2_1_3_1","unstructured":"David Alexander Beckingsale , Jason Burmark , Rich Hornung , Holger Jones , William Killian , Adam J. Kunen , Olga Pearce , Peter Robinson , Brian S. Ryujin , and Thomas R. W. Scogland . 2019. RAJA: Portable performance for large-scale scientific applications . In International Workshop on Performance, Portability and Productivity in HPC. IEEE, 71\u201381 . David Alexander Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas R. W. Scogland. 2019. RAJA: Portable performance for large-scale scientific applications. In International Workshop on Performance, Portability and Productivity in HPC. IEEE, 71\u201381."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTSS46320.2019.00048"},{"key":"e_1_2_1_5_1","unstructured":"OpenMP Architecture Review Board. 2020. OpenMP Application Programming Interface. Version 5.1.  OpenMP Architecture Review Board. 2020. OpenMP Application Programming Interface. Version 5.1."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1010933404324"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018748"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037700"},{"volume-title":"USENIX Annual Technical Conference USENIX, 413\u2013425","author":"Chen Yanhao","key":"e_1_2_1_10_1","unstructured":"Yanhao Chen , Ari B. Hayes , Chi Zhang , Timothy Salmon , and Eddy Z. Zhang . 2018. Locality-aware software throttling for sparse matrix operation on GPUs . In USENIX Annual Technical Conference USENIX, 413\u2013425 . Yanhao Chen, Ari B. Hayes, Chi Zhang, Timothy Salmon, and Eddy Z. Zhang. 2018. Locality-aware software throttling for sparse matrix operation on GPUs. In USENIX Annual Technical Conference USENIX, 413\u2013425."},{"volume-title":"International Conference on Parallel Architectures and Compil. Techniques. ACM, 1\u201313","author":"Cho Younghyun","key":"e_1_2_1_11_1","unstructured":"Younghyun Cho , Florian Negele , Seohong Park , Bernhard Egger , and Thomas R. Gross . 2018. On-the-fly workload partitioning for integrated CPU\/GPU architectures . In International Conference on Parallel Architectures and Compil. Techniques. ACM, 1\u201313 . Younghyun Cho, Florian Negele, Seohong Park, Bernhard Egger, and Thomas R. Gross. 2018. On-the-fly workload partitioning for integrated CPU\/GPU architectures. In International Conference on Parallel Architectures and Compil. Techniques. ACM, 1\u201313."},{"key":"e_1_2_1_12_1","unstructured":"Hongsuk Chung Munsik Kang and Hyun-Duk Cho. 2012. Heterogeneous Multi-processing solution of exynos 5 octa with ARM\u00ae big.LITTLE\u2122 Technology. Samsung White Paper.  Hongsuk Chung Munsik Kang and Hyun-Duk Cho. 2012. Heterogeneous Multi-processing solution of exynos 5 octa with ARM\u00ae big.LITTLE\u2122 Technology. Samsung White Paper."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155641"},{"key":"e_1_2_1_14_1","volume-title":"Algorithmic Skeletons: Structured Management of Parallel Computation","author":"Cole Murray I.","year":"1989","unstructured":"Murray I. Cole . 1989 . Algorithmic Skeletons: Structured Management of Parallel Computation . MIT Press . Murray I. Cole. 1989. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2768405.2768407"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540737"},{"volume-title":"Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 63\u201374","author":"Danalis Anthony","key":"e_1_2_1_17_1","unstructured":"Anthony Danalis , Gabriel Marin , Collin McCurdy , Jeremy S. Meredith , Philip C. Roth , Kyle Spafford , Vinod Tipparaju , and Jeffrey S. Vetter . 2010. The scalable heterogeneous computing (SHOC) benchmark suite . In Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 63\u201374 . Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 63\u201374."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254120"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTSS.2009.46"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451125"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1383422.1383447"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2014.07.003"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCS.2014.47"},{"key":"e_1_2_1_25_1","volume-title":"O\u2019Boyle","author":"Grewe Dominik","year":"2011","unstructured":"Dominik Grewe and Michael F. P . O\u2019Boyle . 2011 . A static task partitioning approach for heterogeneous systems using OpenCL. In International Conference on Compiler Construction. Springer , 286\u2013305. Dominik Grewe and Michael F. P. O\u2019Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In International Conference on Compiler Construction. Springer, 286\u2013305."},{"volume-title":"International Symposium on Code Generation and Optimization. IEEE, 1\u201310","author":"Grewe Dominik","key":"e_1_2_1_26_1","unstructured":"Dominik Grewe , Zheng Wang , and Michael F. P . O\u2019Boyle. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems . In International Symposium on Code Generation and Optimization. IEEE, 1\u201310 . Dominik Grewe, Zheng Wang, and Michael F. P. O\u2019Boyle. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In International Symposium on Code Generation and Optimization. IEEE, 1\u201310."},{"key":"e_1_2_1_27_1","unstructured":"Khronos SYCL\u2122Working Group. 2020. SYCL\u2122 Specification SYCL\u2122 Integrates OpenCL\u2122Devices with Modern C++. Version 1.2.1.  Khronos SYCL\u2122Working Group. 2020. SYCL\u2122 Specification SYCL\u2122 Integrates OpenCL\u2122Devices with Modern C++. Version 1.2.1."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357223.3362714"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2018.10.005"},{"key":"e_1_2_1_30_1","volume-title":"Patterson","author":"Hennessy John L.","year":"2017","unstructured":"John L. Hennessy and David A . Patterson . 2017 . Computer Architecture : A Quantitative Approach (6th ed.). Morgan Kaufmann Publishers . The subsection on Amdahl\u2019s Law in Section 1.9, 49\u201350. John L. Hennessy and David A. Patterson. 2017. Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann Publishers. The subsection on Amdahl\u2019s Law in Section 1.9, 49\u201350."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628088"},{"volume-title":"International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201312","author":"Kambadur Melanie","key":"e_1_2_1_32_1","unstructured":"Melanie Kambadur , Tipp Moseley , Rick Hank , and Martha A. Kim . 2012. Measuring interference between live datacenter applications . In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201312 . Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A. Kim. 2012. Measuring interference between live datacenter applications. In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1\u201312."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTSS.2011.13"},{"volume-title":"International Conference on Parallel Architectures and Compilation Techniques. IEEE, 157\u2013166","author":"Kay\u0131ran Onur","key":"e_1_2_1_34_1","unstructured":"Onur Kay\u0131ran , Adwait Jog , Mahmut T. Kandemir , and Chita R. Das . 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs . In International Conference on Parallel Architectures and Compilation Techniques. IEEE, 157\u2013166 . Onur Kay\u0131ran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In International Conference on Parallel Architectures and Compilation Techniques. IEEE, 157\u2013166."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3337821.3337886"},{"volume-title":"Symposium on Principles and Practices of Parallel Programming. ACM, 2\u201311","author":"Kim Seon Wook","key":"e_1_2_1_36_1","unstructured":"Seon Wook Kim , Chong-Liang Ooi , Rudolf Eigenmann , Babak Falsafi , and T. N. Vijaykumar . 2001. Reference idempotency analysis: A framework for optimizing speculative execution . In Symposium on Principles and Practices of Parallel Programming. ACM, 2\u201311 . Seon Wook Kim, Chong-Liang Ooi, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar. 2001. Reference idempotency analysis: A framework for optimizing speculative execution. In Symposium on Principles and Practices of Parallel Programming. ACM, 2\u201311."},{"key":"e_1_2_1_37_1","volume-title":"Static scheduling algorithms for allocating directed task graphs to multiprocessors. Comput. Surveys 31, 4","author":"Kwok Yu-Kwong","year":"1999","unstructured":"Yu-Kwong Kwok and Ishfaq Ahmad . 1999. Static scheduling algorithms for allocating directed task graphs to multiprocessors. Comput. Surveys 31, 4 ( 1999 ). ACM , 406\u2013471. Yu-Kwong Kwok and Ishfaq Ahmad. 1999. Static scheduling algorithms for allocating directed task graphs to multiprocessors. Comput. Surveys 31, 4 (1999). ACM, 406\u2013471."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2749475"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW50202.2020.00012"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669121"},{"volume-title":"International Symposium on Code Generation and Optimization. ACM, 273\u2013283","author":"Pandit Prasanna","key":"e_1_2_1_41_1","unstructured":"Prasanna Pandit and R. Govindarajan . 2014. Fluidic kernels: Cooperative execution of OpenCL programs on multiple heterogeneous devices . In International Symposium on Code Generation and Optimization. ACM, 273\u2013283 . Prasanna Pandit and R. Govindarajan. 2014. Fluidic kernels: Cooperative execution of OpenCL programs on multiple heterogeneous devices. In International Symposium on Code Generation and Optimization. ACM, 273\u2013283."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2967938.2967964"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080256"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254082"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3319423"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2597652.2597675"},{"volume-title":"Conference on Programming Language Design and Implementation. ACM, 169\u2013180","author":"Sridharan Srinath","key":"e_1_2_1_47_1","unstructured":"Srinath Sridharan , Gagan Gupta , and Gurindar S. Sohi . 2014. Adaptive, efficient, parallel execution of parallel programs . In Conference on Programming Language Design and Implementation. ACM, 169\u2013180 . Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. 2014. Adaptive, efficient, parallel execution of parallel programs. In Conference on Programming Language Design and Implementation. ACM, 169\u2013180."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.85"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3078633.3081040"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2741948.2741964"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTSS.2007.47"},{"volume-title":"Symposium on Principles and Practice of Parallel Programming. ACM, 75\u201384","author":"Wang Zheng","key":"e_1_2_1_52_1","unstructured":"Zheng Wang and Michael F. P . O\u2019Boyle. 2009. Mapping parallelism to multi-cores: A machine learning based approach . In Symposium on Principles and Practice of Parallel Programming. ACM, 75\u201384 . Zheng Wang and Michael F. P. O\u2019Boyle. 2009. Mapping parallelism to multi-cores: A machine learning based approach. In Symposium on Principles and Practice of Parallel Programming. ACM, 75\u201384."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTAS.2015.7108420"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3471909","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3471909","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:49Z","timestamp":1750195489000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3471909"}},"subtitle":["Transparent Mid-Kernel Runtime Switching for Heterogeneous Systems"],"short-title":[],"issued":{"date-parts":[[2021,9,29]]},"references-count":53,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2021,12,31]]}},"alternative-id":["10.1145\/3471909"],"URL":"https:\/\/doi.org\/10.1145\/3471909","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2021,9,29]]},"assertion":[{"value":"2020-12-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-06-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}