{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T08:04:40Z","timestamp":1759133080745,"version":"3.41.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2017,5,26]],"date-time":"2017-05-26T00:00:00Z","timestamp":1495756800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation China","award":["61672048"],"award-info":[{"award-number":["61672048"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2017,11,30]]},"abstract":"<jats:p>Graphics Processing Units (GPUs) have been widely adopted as accelerators for compute-intensive applications due to its tremendous computational power and high memory bandwidth. As the complexity of applications continues to grow, each new generation of GPUs has been equipped with advanced architectural features and more resources to sustain its performance acceleration capability. Recent GPUs have been featured with concurrent kernel execution, which is designed to improve the resource utilization by executing multiple kernels simultaneously. However, it is still a challenge to find a way to manage the resources on GPUs for concurrent kernel execution. Prior works only achieve limited performance improvement as they do not optimize the thread-level parallelism (TLP) and model the resource contention for the concurrently executing kernels.<\/jats:p>\n          <jats:p>In this article, we design an efficient kernel management framework that optimizes the performance for concurrent kernel execution on GPUs. Our kernel management framework contains two key components: TLP modulation and cache bypassing. The TLP modulation is employed to adjust the TLP for the concurrently executing kernels. It consists of three parts: kernel categorization, static TLP modulation, and dynamic TLP modulation. The cache bypassing is proposed to mitigate the cache contention by only allowing a subset of a kernel\u2019s blocks to access the L1 data cache. Experiments indicate that our framework can improve the performance by 1.51 \u00d7 on average (energy-efficiency by 1.39 \u00d7 on average), compared with the default concurrent kernel execution framework.<\/jats:p>","DOI":"10.1145\/3070710","type":"journal-article","created":{"date-parts":[[2017,5,31]],"date-time":"2017-05-31T19:32:40Z","timestamp":1496259160000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Efficient Kernel Management on GPUs"],"prefix":"10.1145","volume":"16","author":[{"given":"Yun","family":"Liang","sequence":"first","affiliation":[{"name":"Peking University, China"}]},{"given":"Xiuhong","family":"Li","sequence":"additional","affiliation":[{"name":"Peking University, China"}]}],"member":"320","published-online":{"date-parts":[[2017,5,26]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168946"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2012.6402918"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018748"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.11"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2012.18"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2011.5749714"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.30"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.18"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/2342788.2342798"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2597652.2597685"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628101"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304582"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451158"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485951"},{"volume-title":"Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques (PACT\u201913)","year":"2013","author":"Kayiran Onur","key":"e_1_2_1_17_1"},{"volume-title":"Proceedings of the 2010 43rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-43)","author":"Lee Jaekyu","key":"e_1_2_1_18_1"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835937"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628107"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751237"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2015.7054184"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.3850\/9783981537079_0647"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2014.2313342"},{"key":"e_1_2_1_26_1","first-page":"10","article-title":"An efficient framework for cache bypassing on GPUs","volume":"32","author":"Liang Yun","year":"2015","journal-title":"IEEE Trans. Comput.-Aid. Des."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.76"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499368.2451160"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750410"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.16"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540718"},{"volume-title":"Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report","year":"2012","author":"Stratton John A.","key":"e_1_2_1_33_1"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2014.6853208"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751213"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830813"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2013.6691165"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056023"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2897989"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3070710","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3070710","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:30:27Z","timestamp":1750217427000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3070710"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,5,26]]},"references-count":39,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2017,11,30]]}},"alternative-id":["10.1145\/3070710"],"URL":"https:\/\/doi.org\/10.1145\/3070710","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2017,5,26]]},"assertion":[{"value":"2016-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-05-26","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}