{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,17]],"date-time":"2026-01-17T11:42:35Z","timestamp":1768650155115,"version":"3.49.0"},"reference-count":59,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2019,6,17]],"date-time":"2019-06-17T00:00:00Z","timestamp":1560729600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National science foundation","doi-asserted-by":"publisher","award":["1618509"],"award-info":[{"award-number":["1618509"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,9,30]]},"abstract":"<jats:p>Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel execution (CKE) improves both resource utilization and computational throughput. Most of the prior works focus on partitioning the GPU resources at the cooperative thread array (CTA) level or the warp scheduler level to improve CKE. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. The reason is that bandwidth over-subscription from bandwidth-intensive kernels leads to much aggravated memory access latency, which is highly detrimental to latency-sensitive kernels. Even among bandwidth-intensive kernels, more intensive kernels may unfairly consume much higher bandwidth than less-intensive ones.<\/jats:p>\n          <jats:p>In this article, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach dynamically detects co-running kernels as latency sensitive or bandwidth intensive. As both the DRAM bandwidth and L2-to-L1 Network-on-Chip (NoC) bandwidth can be the critical resource, our approach partitions both bandwidth resources coordinately along with selecting proper CTA combinations. The key objective is to allocate more CTA resources for latency-sensitive kernels and more NoC\/DRAM bandwidth resources to NoC-\/DRAM-intensive kernels. We achieve it using a variation of dominant resource fairness (DRF). Compared with two state-of-the-art CKE optimization schemes, SMK [52] and WS [55], our approach improves the average harmonic speedup by 78% and 39%, respectively. Even compared to the best possible CTA combinations, which are obtained from an exhaustive search among all possible CTA combinations, our approach improves the harmonic speedup by up to 51% and 11% on average.<\/jats:p>","DOI":"10.1145\/3326124","type":"journal-article","created":{"date-parts":[[2019,6,18]],"date-time":"2019-06-18T12:14:26Z","timestamp":1560860066000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution"],"prefix":"10.1145","volume":"16","author":[{"given":"Zhen","family":"Lin","sequence":"first","affiliation":[{"name":"North Carolina State University, Raleigh, NC, USA"}]},{"given":"Hongwen","family":"Dai","sequence":"additional","affiliation":[{"name":"North Carolina State University, Raleigh, NC, USA"}]},{"given":"Michael","family":"Mantor","sequence":"additional","affiliation":[{"name":"Advanced Micro Devices, Orlando, FL, USA"}]},{"given":"Huiyang","family":"Zhou","sequence":"additional","affiliation":[{"name":"North Carolina State University, Raleigh, NC, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,6,17]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. 1--12","author":"Adriaens J. T.","unstructured":"J. T. Adriaens , K. Compton , N. S. Kim , and M. J. Schulte . 2012. The case for GPGPU spatial multitasking . In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. 1--12 . J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. 1--12."},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the 2014 IEEE 32nd International Conference on Computer Design (ICCD\u201914)","author":"Aguilera P.","unstructured":"P. Aguilera , K. Morrow , and N. S. Kim . 2014. Fair share: Allocation of GPU resources for both performance and fairness . In Proceedings of the 2014 IEEE 32nd International Conference on Computer Design (ICCD\u201914) . 440--447. P. Aguilera, K. Morrow, and N. S. Kim. 2014. Fair share: Allocation of GPU resources for both performance and fairness. In Proceedings of the 2014 IEEE 32nd International Conference on Computer Design (ICCD\u201914). 440--447."},{"key":"e_1_2_1_3_1","volume-title":"AMD Graphics Cores Next (GCN) Architecture White Paper","author":"AMD.","year":"2012","unstructured":"AMD. 2012. AMD Graphics Cores Next (GCN) Architecture White Paper . AMD Corporation ( 2012 ). AMD. 2012. AMD Graphics Cores Next (GCN) Architecture White Paper. AMD Corporation (2012)."},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 2012 39th Annual International Symposium on Computer Architecture (ISCA\u201912)","author":"Ausavarungnirun R.","unstructured":"R. Ausavarungnirun , K. K. W. Chang , L. Subramanian , G. H. Loh , and O. Mutlu . 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems . In Proceedings of the 2012 39th Annual International Symposium on Computer Architecture (ISCA\u201912) . 416--427. R. Ausavarungnirun, K. K. W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 2012 39th Annual International Symposium on Computer Architecture (ISCA\u201912). 416--427."},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT\u201915)","author":"Ausavarungnirun R.","unstructured":"R. Ausavarungnirun , S. Ghose , O. Kayiran , G. H. Loh , C. R. Das , M. T. Kandemir , and O. Mutlu . 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance . In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT\u201915) . 25--38. R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT\u201915). 25--38."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.50"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2003.07.004"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037700"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2872362.2872368"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201918)","author":"Dai H.","unstructured":"H. Dai , Z. Lin , C. Li , C. Zhao , F. Wang , N. Zheng , and H. Zhou . 2018. Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls . In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201918) . H. Dai, Z. Lin, C. Li, C. Zhao, F. Wang, N. Zheng, and H. Zhou. 2018. Accelerate GPU concurrent kernel execution by mitigating memory pipeline stalls. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201918)."},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)","author":"Das R.","unstructured":"R. Das , O. Mutlu , T. Moscibroda , and C. R. Das . 2009. Application-aware prioritization mechanisms for on-chip networks . In Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909) . 280--291. R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909). 280--291."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910)","author":"Das Reetuparna","unstructured":"Reetuparna Das , Onur Mutlu , Thomas Moscibroda , and Chita R. Das . 2010. A\u00e9Rgia: Exploiting packet latency slack in on-chip networks . In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910) . ACM, New York, NY, 106--116. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. A\u00e9Rgia: Exploiting packet latency slack in on-chip networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA\u201910). ACM, New York, NY, 106--116."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736058"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2008.44"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI\u201911)","author":"Ghodsi Ali","year":"2011","unstructured":"Ali Ghodsi , Matei Zaharia , Benjamin Hindman , Andy Konwinski , Scott Shenker , and Ion Stoica . 2011 . Dominant resource fairness: Fair allocation of multiple resource types . In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI\u201911) . USENIX Association, Berkeley, CA, 323--336. Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI\u201911). USENIX Association, Berkeley, CA, 323--336."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the Innovative Parallel Computing (InPar\u201912)","author":"Grauer-Gray S.","unstructured":"S. Grauer-Gray , L. Xu , R. Searles , S. Ayalasomayajula , and J. Cavazos . 2012. Auto-tuning a high-level language targeted to GPU codes . In Proceedings of the Innovative Parallel Computing (InPar\u201912) . 1--10. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the Innovative Parallel Computing (InPar\u201912). 1--10."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA\u201911)","author":"Grot B.","unstructured":"B. Grot , J. Hestness , S. W. Keckler , and O. Mutlu . 2011. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees . In Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA\u201911) . B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. 2011. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees. In Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA\u201911)."},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)","author":"Grot B.","unstructured":"B. Grot , S. W. Keckler , and O. Mutlu . 2009. Preemptive virtual clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip . In Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909) . B. Grot, S. W. Keckler, and O. Mutlu. 2009. Preemptive virtual clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip. In Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)."},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the 2015 52nd ACM\/EDAC\/IEEE Design Automation Conference (DAC\u201915)","author":"Jang H.","unstructured":"H. Jang , J. Kim , P. Gratz , Ki Hwan Yum , and E. J. Kim . 2015. Bandwidth-efficient on-chip interconnect designs for GPGPUs . In Proceedings of the 2015 52nd ACM\/EDAC\/IEEE Design Automation Conference (DAC\u201915) . 1--6. H. Jang, J. Kim, P. Gratz, Ki Hwan Yum, and E. J. Kim. 2015. Bandwidth-efficient on-chip interconnect designs for GPGPUs. In Proceedings of the 2015 52nd ACM\/EDAC\/IEEE Design Automation Conference (DAC\u201915). 1--6."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/2738600.2738602"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588768.2576780"},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 2015 International Symposium on Memory Systems (MEMSYS\u201915)","author":"Jog Adwait","unstructured":"Adwait Jog , Onur Kayiran , Tuba Kesten , Ashutosh Pattnaik , Evgeny Bolotin , Niladrish Chatterjee , Stephen W. Keckler , Mahmut T. Kandemir , and Chita R. Das . 2015. Anatomy of GPU memory system for multi-application execution . In Proceedings of the 2015 International Symposium on Memory Systems (MEMSYS\u201915) . ACM, New York, NY, 12. Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2015. Anatomy of GPU memory system for multi-application execution. In Proceedings of the 2015 International Symposium on Memory Systems (MEMSYS\u201915). ACM, New York, NY, 12."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2896377.2901468"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCOM.1987.1096719"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the 47th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-47)","author":"Kayiran Onur","unstructured":"Onur Kayiran , Nachiappan Chidambaram Nachiappan , Adwait Jog , Rachata Ausavarungnirun , Mahmut T. Kandemir , Gabriel H. Loh , Onur Mutlu , and Chita R. Das . 2014. Managing GPU concurrency in heterogeneous architectures . In Proceedings of the 47th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-47) . IEEE Computer Society, Los Alamitos, CA, 114--126. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In Proceedings of the 47th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Los Alamitos, CA, 114--126."},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 16th International Symposium on High-Performance Computer Architecture.","author":"Kim Y.","unstructured":"Y. Kim , D. Han , O. Mutlu , and M. Harchol-Balter . 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers . In Proceedings of the 16th International Symposium on High-Performance Computer Architecture. Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.51"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/L-CA.2011.32"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168947"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.31"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2355585.2355590"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 2016 Design, Automation Test in Europe Conference Exhibition (DATE\u201916)","author":"Li X.","unstructured":"X. Li and Y. Liang . 2016. Efficient kernel management on GPUs . In Proceedings of the 2016 Design, Automation Test in Europe Conference Exhibition (DATE\u201916) . 85--90. X. Li and Y. Liang. 2016. Efficient kernel management on GPUs. In Proceedings of the 2016 Design, Automation Test in Europe Conference Exhibition (DATE\u201916). 85--90."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3070710"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3177964"},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the 2001 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201901)","author":"Luo Kun","unstructured":"Kun Luo , J. Gummaraju , and M. Franklin . 2001. Balancing thoughput and fairness in SMT processors . In Proceedings of the 2001 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201901) . 164--171. Kun Luo, J. Gummaraju, and M. Franklin. 2001. Balancing thoughput and fairness in SMT processors. In Proceedings of the 2001 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201901). 164--171."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2549523"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.40"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155656"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.24"},{"key":"e_1_2_1_41_1","volume-title":"Whitepaper: NVIDIA\u2019s next generation CUDA compute architecture: Kepler GK110.","author":"NVIDIA.","year":"2014","unstructured":"NVIDIA. 2014 . Whitepaper: NVIDIA\u2019s next generation CUDA compute architecture: Kepler GK110. (2014). NVIDIA. 2014. Whitepaper: NVIDIA\u2019s next generation CUDA compute architecture: Kepler GK110. (2014)."},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913)","author":"Pai Sreepathi","unstructured":"Sreepathi Pai , Matthew J. Thazhuthaveetil , and R. Govindarajan . 2013. Improving GPGPU concurrency with elastic kernels . In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913) . ACM, New York, NY, 407--418. Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913). ACM, New York, NY, 407--418."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037707"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.49"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT\u201907)","author":"Rafique N.","unstructured":"N. Rafique , W. T. Lim , and M. Thottethodi . 2007. Effective management of DRAM bandwidth in multicore processors . In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT\u201907) . 245--258. N. Rafique, W. T. Lim, and M. Thottethodi. 2007. Effective management of DRAM bandwidth in multicore processors. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques (PACT\u201907). 245--258."},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)","author":"Subramanian L.","unstructured":"L. Subramanian , V. Seshadri , A. Ghosh , S. Khan , and O. Mutlu . 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory . In Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915) . 62--75. L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu. 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915). 62--75."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522356"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD.2014.43"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847255"},{"key":"e_1_2_1_51_1","volume-title":"Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201918)","author":"Wang H.","unstructured":"H. Wang , F. Luo , M. Ibrahim , O. Kayiran , and A. Jog . 2018. Efficient and fair multi-programming in GPUs via effective bandwidth management . In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201918) . H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog. 2018. Efficient and fair multi-programming in GPUs via effective bandwidth management. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201918)."},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201916)","author":"Wang Z.","unstructured":"Z. Wang , J. Yang , R. Melhem , B. Childers , Y. Zhang , and M. Guo . 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing . In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201916) . 358--369. Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201916). 358--369."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080203"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/2751205.2751213"},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916)","author":"Xu Q.","unstructured":"Q. Xu , H. Jeon , K. Kim , W. W. Ro , and M. Annavaram . 2016. Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming . In Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916) . 230--242. Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. 2016. Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA\u201916). 230--242."},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)","author":"Yuan G. L.","unstructured":"G. L. Yuan , A. Bakhoda , and T. M. Aamodt . 2009. Complexity effective memory access scheduling for many-core accelerator architectures . In Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909) . G. L. Yuan, A. Bakhoda, and T. M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 2009 42nd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201909)."},{"key":"e_1_2_1_57_1","volume-title":"Proceedings of the 49th International Symposium on Microarchitecture (MICRO\u201916)","author":"Zhan J.","unstructured":"J. Zhan , O. Kay\u0131ran , G. H. Loh , C. R. Das , and Y. Xie . 2016. OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures . In Proceedings of the 49th International Symposium on Microarchitecture (MICRO\u201916) . J. Zhan, O. Kay\u0131ran, G. H. Loh, C. R. Das, and Y. Xie. 2016. OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. In Proceedings of the 49th International Symposium on Microarchitecture (MICRO\u201916)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.257"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.53"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/2786572.2786596"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3326124","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3326124","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:53:08Z","timestamp":1750204388000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3326124"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,6,17]]},"references-count":59,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2019,9,30]]}},"alternative-id":["10.1145\/3326124"],"URL":"https:\/\/doi.org\/10.1145\/3326124","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,6,17]]},"assertion":[{"value":"2018-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-06-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}