{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:28:04Z","timestamp":1750220884836,"version":"3.41.0"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,11,19]],"date-time":"2019-11-19T00:00:00Z","timestamp":1574121600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU.<\/jats:p>\n          <jats:p>GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism.<\/jats:p>\n          <jats:p>This article presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 5.52X over PThreads running on a 20-core CPU, 1.76X over CUDA-HyperQ, and 1.44X over GeMTC, the state-of-the-art runtime GPU task scheduling system.<\/jats:p>","DOI":"10.1145\/3365657","type":"journal-article","created":{"date-parts":[[2019,11,19]],"date-time":"2019-11-19T13:46:47Z","timestamp":1574171207000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["Pagoda"],"prefix":"10.1145","volume":"6","author":[{"given":"Tsung Tai","family":"Yeh","sequence":"first","affiliation":[{"name":"Purdue University, West Lafayette, IN, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Amit","family":"Sabne","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Putt","family":"Sakdhnagool","sequence":"additional","affiliation":[{"name":"National Electronics and Computer Technology Center"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rudolf","family":"Eigenmann","sequence":"additional","affiliation":[{"name":"University of Delaware"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Timothy G.","family":"Rogers","sequence":"additional","affiliation":[{"name":"Purdue University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,11,19]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Available: http:\/\/standards.ieee.org\/findstds\/standard\/802.3-2012.html (accessed","author":"IEEE Standards Association","year":"2018","unstructured":"IEEE Standards Association . 2012. 802.3-2012- IEEE Standard for Ethernet . [Online]. Available: http:\/\/standards.ieee.org\/findstds\/standard\/802.3-2012.html (accessed March . 5, 2018 ). IEEE Standards Association. 2012. 802.3-2012-IEEE Standard for Ethernet. [Online]. Available: http:\/\/standards.ieee.org\/findstds\/standard\/802.3-2012.html (accessed March. 5, 2018)."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.1631"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2287076.2287090"},{"key":"e_1_2_1_4_1","volume-title":"Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS). 557--568","author":"Bueno J.","year":"2012","unstructured":"J. Bueno , J. Planas , A. Duran , R. M. Badia , X. Martorell , E. Ayguad\u00e9 , and J. Labarta . 2012. Productive programming of GPU clusters with Ompss . In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS). 557--568 . DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2012 .58 10.1109\/IPDPS.2012.58 J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguad\u00e9, and J. Labarta. 2012. Productive programming of GPU clusters with Ompss. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS). 557--568. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2012.58"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2009.64"},{"key":"e_1_2_1_6_1","first-page":"1","article-title":"46-3: Data encryption standard (DES)","volume":"25","author":"PUB","year":"1999","unstructured":"PUB FIPS. 1999 . 46-3: Data encryption standard (DES) . National Institute of Standards and Technology 25 , 10 (1999), 1 -- 22 . PUB FIPS. 1999. 46-3: Data encryption standard (DES). National Institute of Standards and Technology 25, 10 (1999), 1--22.","journal-title":"National Institute of Standards and Technology"},{"volume-title":"Available: http:\/\/fraqtive.mimec.org\/ (accessed","year":"2019","key":"e_1_2_1_7_1","unstructured":"Fraqtive. 2016. [Online]. Available: http:\/\/fraqtive.mimec.org\/ (accessed January 5, 2019 ). Fraqtive. 2016. [Online]. Available: http:\/\/fraqtive.mimec.org\/ (accessed January 5, 2019)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/1413370.1413373"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/2342788.2342798"},{"volume-title":"Proceedings of the Symposium on Innovative Parallel Computing (InPar\u201912)","author":"Gupta Kunal","key":"e_1_2_1_10_1","unstructured":"Kunal Gupta , Jeff A. Stuart , and John D. Owens . 2012. A study of persistent threads style GPU programming for GPGPU workloads . In Proceedings of the Symposium on Innovative Parallel Computing (InPar\u201912) . IEEE, 1--14. Kunal Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In Proceedings of the Symposium on Innovative Parallel Computing (InPar\u201912). IEEE, 1--14."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/GlobalSIP.2014.7032135"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC\u201911)","author":"Kato Shinpei","year":"2011","unstructured":"Shinpei Kato , Karthik Lakshmanan , Ragunathan Rajkumar , and Yutaka Ishikawa . 2011 . TimeGraph: GPU scheduling for real-time multi-tasking environments . In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC\u201911) . USENIX Association, Berkeley, CA, 2--2. http:\/\/dl.acm.org\/citation.cfm?id&equals; 2002181.2002183 Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC\u201911). USENIX Association, Berkeley, CA, 2--2. http:\/\/dl.acm.org\/citation.cfm?id&equals;2002181.2002183"},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201914)","author":"Kim Sangman","year":"2014","unstructured":"Sangman Kim , Seonggu Huh , Yige Hu , Xinya Zhang , Emmett Witchel , Amir Wated , and Mark Silberstein . 2014 . GPUnet: Networking abstractions for GPU programs . In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201914) . USENIX Association, Berkeley, CA, 201--216. http:\/\/dl.acm.org\/citation.cfm?id&equals;2685048.2685065 Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking abstractions for GPU programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI\u201914). USENIX Association, Berkeley, CA, 201--216. http:\/\/dl.acm.org\/citation.cfm?id&equals;2685048.2685065"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/365628.365655"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2600212.2600228"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1137\/1034004"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2011.66"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2001.968595"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517327.2442527"},{"key":"e_1_2_1_20_1","volume-title":"Texture-based Separable Convolution. [Online]. Available: http:\/\/developer.download.nvidia.com\/compute\/DevZone\/C\/html_x64\/Image_Processing.html. (accessed","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. 2007. Texture-based Separable Convolution. [Online]. Available: http:\/\/developer.download.nvidia.com\/compute\/DevZone\/C\/html_x64\/Image_Processing.html. (accessed January 5, 2019 ). NVIDIA. 2007. Texture-based Separable Convolution. [Online]. Available: http:\/\/developer.download.nvidia.com\/compute\/DevZone\/C\/html_x64\/Image_Processing.html. (accessed January 5, 2019)."},{"key":"e_1_2_1_21_1","volume-title":"Available: http:\/\/developer.download.nvidia.com\/compute\/DevZone\/C\/html_x64\/6_Advanced\/simpleHyperQ\/doc\/HyperQ.pdf (accessed","author":"Example NVIDIA.","year":"2019","unstructured":"NVIDIA. 2012. Hyper-Q Example . [Online]. Available: http:\/\/developer.download.nvidia.com\/compute\/DevZone\/C\/html_x64\/6_Advanced\/simpleHyperQ\/doc\/HyperQ.pdf (accessed January 5, 2019 ). NVIDIA. 2012. Hyper-Q Example. [Online]. Available: http:\/\/developer.download.nvidia.com\/compute\/DevZone\/C\/html_x64\/6_Advanced\/simpleHyperQ\/doc\/HyperQ.pdf (accessed January 5, 2019)."},{"key":"e_1_2_1_22_1","volume-title":"The White Paper of Discrete Cosine Transform for 8x8 Blocks with CUDA. [Online]. Available: http:\/\/www.math.uaa.alaska.edu\/&tilde;ssiewert\/a385_doc\/dct8x8.pdf (accessed","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. 2012. The White Paper of Discrete Cosine Transform for 8x8 Blocks with CUDA. [Online]. Available: http:\/\/www.math.uaa.alaska.edu\/&tilde;ssiewert\/a385_doc\/dct8x8.pdf (accessed January 5, 2019 ). NVIDIA. 2012. The White Paper of Discrete Cosine Transform for 8x8 Blocks with CUDA. [Online]. Available: http:\/\/www.math.uaa.alaska.edu\/&tilde;ssiewert\/a385_doc\/dct8x8.pdf (accessed January 5, 2019)."},{"key":"e_1_2_1_23_1","volume-title":"Available: http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/ (accessed","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. 2015. CUDA. [Online]. Available: http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/ (accessed January 5, 2019 ). NVIDIA. 2015. CUDA. [Online]. Available: http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/ (accessed January 5, 2019)."},{"key":"e_1_2_1_24_1","volume-title":"Available: http:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/ (accessed","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. 2016. PTX. [Online]. Available: http:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/ (accessed January 5, 2019 ). NVIDIA. 2016. PTX. [Online]. Available: http:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/ (accessed January 5, 2019)."},{"volume-title":"Presented as part of the 14th Workshop on Hot Topics in Operating Systems","author":"Ousterhout Kay","key":"e_1_2_1_25_1","unstructured":"Kay Ousterhout , Aurojit Panda , Joshua Rosen , Shivaram Venkataraman , Reynold Xin , Sylvia Ratnasamy , Scott Shenker , and Ion Stoica . 2013. The case for tiny tasks in compute clusters . In Presented as part of the 14th Workshop on Hot Topics in Operating Systems . USENIX , Berkeley, CA . https:\/\/www.usenix.org\/conference\/hotos13\/case-tiny-tasks-compute-clusters. Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. 2013. The case for tiny tasks in compute clusters. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems. USENIX, Berkeley, CA. https:\/\/www.usenix.org\/conference\/hotos13\/case-tiny-tasks-compute-clusters."},{"volume-title":"Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913)","author":"Pai Sreepathi","key":"e_1_2_1_26_1","unstructured":"Sreepathi Pai , Matthew J. Thazhuthaveetil , and R. Govindarajan . 2013. Improving GPGPU concurrency with elastic kernels . In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913) . ACM, New York, 407--418. DOI:https:\/\/doi.org\/10.1145\/2451116.2451160 10.1145\/2451116.2451160 Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201913). ACM, New York, 407--418. DOI:https:\/\/doi.org\/10.1145\/2451116.2451160"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201915)","author":"Kyu Park Jason Jong","year":"2015","unstructured":"Jason Jong Kyu Park , Yongjun Park , and Scott Mahlke . 2015 . Chimera: Collaborative preemption for multitasking on a shared GPU . In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201915) . ACM, New York, 593--606. DOI:https:\/\/doi.org\/10.1145\/2694344.2694346 10.1145\/2694344.2694346 Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative preemption for multitasking on a shared GPU. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201915). ACM, New York, 593--606. DOI:https:\/\/doi.org\/10.1145\/2694344.2694346"},{"volume-title":"Parallel Programming in C with MPI and OpenMP","author":"Quinn Michael J.","key":"e_1_2_1_28_1","unstructured":"Michael J. Quinn . 2003. Parallel Programming in C with MPI and OpenMP . McGraw-Hill Education Group . Michael J. Quinn. 2003. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education Group."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1996130.1996160"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465023"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2465829.2465830"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451169"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/155332.155334"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1999.1596"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/2665671.2665702"},{"volume-title":"Compiler Construction","author":"Thies William","key":"e_1_2_1_36_1","unstructured":"William Thies , Michal Karczmarek , and Saman Amarasinghe . 2002. StreamIt: A language for streaming applications . In Compiler Construction . Springer , 179--196. William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002. StreamIt: A language for streaming applications. In Compiler Construction. Springer, 179--196."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2008.5214359"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/GreenCom-CPSCom.2010.102"},{"key":"e_1_2_1_39_1","first-page":"1","article-title":"Simultaneous multikernel: Fine-grained sharing of GPGPUs","volume":"99","author":"Wang Z.","year":"2015","unstructured":"Z. Wang , J. Yang , R. Melhem , B. Childers , Y. Zhang , and M. Guo . 2015 . Simultaneous multikernel: Fine-grained sharing of GPGPUs . IEEE Computer Architecture Letters PP , 99 (2015), 1 -- 1 . DOI:https:\/\/doi.org\/10.1109\/LCA.2015.2477405 10.1109\/LCA.2015.2477405 Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2015. Simultaneous multikernel: Fine-grained sharing of GPGPUs. IEEE Computer Architecture Letters PP, 99 (2015), 1--1. DOI:https:\/\/doi.org\/10.1109\/LCA.2015.2477405","journal-title":"IEEE Computer Architecture Letters PP"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-60368-9_19"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370858"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.257"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3365657","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3365657","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:44:21Z","timestamp":1750203861000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3365657"}},"subtitle":["A GPU Runtime System for Narrow Tasks"],"short-title":[],"issued":{"date-parts":[[2019,11,19]]},"references-count":42,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3365657"],"URL":"https:\/\/doi.org\/10.1145\/3365657","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2019,11,19]]},"assertion":[{"value":"2018-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-11-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}