{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,21]],"date-time":"2026-04-21T23:27:32Z","timestamp":1776814052006,"version":"3.51.2"},"reference-count":33,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T00:00:00Z","timestamp":1583280000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"The Research Grants Council of Hong Kong","award":["Grant 106160098"],"award-info":[{"award-number":["Grant 106160098"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,3,31]]},"abstract":"<jats:p>As a critical computing resource in multiuser systems such as supercomputers, data centers, and cloud services, a GPU contains multiple compute units (CUs). GPU Multitasking is an intuitive solution to underutilization in GPGPU computing. Recently proposed solutions of multitasking GPUs can be classified into two categories: (1) spatially partitioned sharing (SPS), which coexecutes different kernels on disjointed sets of compute units (CU), and (2) simultaneous multikernel (SMK), which runs multiple kernels simultaneously within a CU. Compared to SPS, SMK can improve resource utilization even further due to the interleaving of instructions from kernels with low dynamic resource contentions.<\/jats:p>\n          <jats:p>\n            However, it is hard to implement SMK on current GPU architecture, because (1) techniques for applying SMK on top of GPU hardware scheduling policy are scarce and (2) finding an efficient SMK scheme is difficult due to the complex interferences of concurrently executed kernels. In this article, we propose a lightweight and effective performance model to evaluate the complex interferences of SMK. Based on the probability of independent events, our performance model is built from a totally new angle and contains limited parameters. Then, we propose a metric,\n            <jats:italic>symbiotic factor<\/jats:italic>\n            , which can evaluate an SMK scheme so that kernels with complementary resource utilization can corun within a CU. Also, we analyze the advantages and disadvantages of kernel slicing and kernel stretching techniques and integrate them to apply SMK on GPUs instead of simulators. We validate our model on 18 benchmarks. Compared to the optimized hardware-based concurrent kernel execution whose kernel launching order brings fast execution time, the results of corunning kernel pairs show 11%, 18%, and 12% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average. Compared to the Warped-Slicer, the results show 29%, 18%, and 51% speedup on AMD R9 290X, RX 480, and Vega 64, respectively, on average.\n          <\/jats:p>","DOI":"10.1145\/3377138","type":"journal-article","created":{"date-parts":[[2020,3,4]],"date-time":"2020-03-04T12:50:12Z","timestamp":1583326212000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0372-5504","authenticated-orcid":false,"given":"Hao","family":"Wu","sequence":"first","affiliation":[{"name":"The University of Hong Kong, Hong Kong"}]},{"given":"Weizhi","family":"Liu","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong"}]},{"given":"Huanxin","family":"Lin","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong"}]},{"given":"Cho-Li","family":"Wang","sequence":"additional","affiliation":[{"name":"The University of Hong Kong, Hong Kong"}]}],"member":"320","published-online":{"date-parts":[[2020,3,4]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2014.6974717"},{"key":"e_1_2_2_2_1","unstructured":"AMD. [n.d.]. CodeXL. http:\/\/gpuopen.com\/compute-product\/codexl\/.  AMD. [n.d.]. CodeXL. http:\/\/gpuopen.com\/compute-product\/codexl\/."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jcp.2008.01.047"},{"key":"e_1_2_2_4_1","volume-title":"Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 105--114","author":"Baghsorkhi Sara S."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00027"},{"key":"e_1_2_2_7_1","doi-asserted-by":"crossref","unstructured":"S. Grauer-Gray L. Xu R. Searles S. Ayalasomayajula and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar\u201912). 1--10. DOI:https:\/\/doi.org\/10.1109\/InPar.2012.6339595  S. Grauer-Gray L. Xu R. Searles S. Ayalasomayajula and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar\u201912). 1--10. DOI:https:\/\/doi.org\/10.1109\/InPar.2012.6339595","DOI":"10.1109\/InPar.2012.6339595"},{"key":"e_1_2_2_8_1","first-page":"8","article-title":"The Opencl specification","volume":"1","author":"Khronos OpenCL Working Group et al.","year":"2008","journal-title":"Version"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555815.1555775"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2016.14"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1735688.1735696"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2597917.2597925"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2014.2313342"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3326124"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3177964"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2012.7476485"},{"key":"e_1_2_2_17_1","volume-title":"Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 82--93","author":"Margiolas Christos"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2549523"},{"key":"e_1_2_2_19_1","unstructured":"Nvidia. [n.d.]. CUDA C Propgramming Guide. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html.  Nvidia. [n.d.]. CUDA C Propgramming Guide. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html."},{"key":"e_1_2_2_20_1","unstructured":"Nvidia. [n.d.]. Tuning CUDA Applications for Kepler. https:\/\/docs.nvidia.com\/cuda\/kepler-tuning-guide\/index.html.  Nvidia. [n.d.]. Tuning CUDA Applications for Kepler. https:\/\/docs.nvidia.com\/cuda\/kepler-tuning-guide\/index.html."},{"key":"e_1_2_2_21_1","volume-title":"Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 407--418","author":"Pai Sreepathi"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037707"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056031"},{"key":"e_1_2_2_24_1","first-page":"1","article-title":"GPUfs: Integrating a file system with GPUs","volume":"41","author":"Silberstein Mark","year":"2013","journal-title":"SIGARCH Comput. Arch. News"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145819"},{"key":"e_1_2_2_26_1","volume-title":"Proceedings of the 2011 IEEE International Parallel 8 Distributed Processing Symposium. IEEE Computer Society, 1068--1079","author":"Jeff","year":"2011"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/2537857.2537861"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00030"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446078"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.19"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.29"},{"key":"e_1_2_2_32_1","volume-title":"Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE Computer Society, 382--393","author":"Zhang Yao","year":"2014"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.257"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3377138","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3377138","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:23:38Z","timestamp":1750202618000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3377138"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,4]]},"references-count":33,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,3,31]]}},"alternative-id":["10.1145\/3377138"],"URL":"https:\/\/doi.org\/10.1145\/3377138","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,3,4]]},"assertion":[{"value":"2019-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-03-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}