{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,22]],"date-time":"2026-03-22T22:42:17Z","timestamp":1774219337435,"version":"3.50.1"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,3,21]],"date-time":"2025-03-21T00:00:00Z","timestamp":1742515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U23A6007, 62122053"],"award-info":[{"award-number":["U23A6007, 62122053"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>Random walk is a powerful tool for large-scale graph learning, but its high computational demand presents a challenge. While GPUs can accelerate random walk tasks, current frameworks fail to fully utilize GPU parallelism due to memory-to-compute bandwidth imbalance. In this article, CoWalker, an efficient GPU framework, is proposed to facilitate concurrent execution of random walks for high overall throughput. CoWalker features three novel designs. First, it incorporates a multi-level execution model that effectively orchestrates diverse walk tasks and reduces GPU stalls based on multiple graph characteristics. Second, it collaboratively manages graph data and streaming multiprocessors to minimize memory access interference and maximize core utilization under concurrent tasks. Finally, a multi-dimensional scheduler selects compatible random walk task combinations based on memory footprints to achieve maximum throughput. CoWalker significantly improves throughput over state-of-the-art baselines by mitigating concurrency overheads and effectively harnessing GPU parallelism. Our extensive evaluations on real-world workloads demonstrate that CoWalker achieves 2.75\u00d7 higher overall system throughput compared with commercial tools and 1.56\u00d7 over the SOTA academic system.<\/jats:p>","DOI":"10.1145\/3711820","type":"journal-article","created":{"date-parts":[[2025,1,10]],"date-time":"2025-01-10T11:21:26Z","timestamp":1736508086000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4062-3558","authenticated-orcid":false,"given":"Cheng","family":"Xu","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6218-4659","authenticated-orcid":false,"given":"Chao","family":"Li","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4372-7851","authenticated-orcid":false,"given":"Xiaofeng","family":"Hou","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-1956-1242","authenticated-orcid":false,"given":"Junyi","family":"Mei","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0547-6780","authenticated-orcid":false,"given":"Jing","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3704-1530","authenticated-orcid":false,"given":"Pengyu","family":"Wang","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4060-9438","authenticated-orcid":false,"given":"Shixuan","family":"Sun","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0034-2302","authenticated-orcid":false,"given":"Minyi","family":"Guo","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-4936-1534","authenticated-orcid":false,"given":"Baoping","family":"Hao","sequence":"additional","affiliation":[{"name":"Alibaba Inc, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,3,21]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2023. Ali Cloud. Retrieved March 03 2023 from https:\/\/www.alibabacloud.com\/zh"},{"key":"e_1_3_2_3_2","unstructured":"2023. Bytedance. Retrieved March 03 2023 from https:\/\/www.bytedance.com\/en\/"},{"key":"e_1_3_2_4_2","unstructured":"2023. Facebook. Retrieved March 03 2023 from http:\/\/www.facebook.com\/"},{"key":"e_1_3_2_5_2","unstructured":"2023. Google. Retrieved March 03 2023 from http:\/\/www.google.com\/"},{"key":"e_1_3_2_6_2","volume-title":"Proceedings of the 9th IEEE Conference on Visual Analytics Science and Technology, IEEE VAST 2014","author":"Amor-Amoros Albert","year":"2014","unstructured":"Albert Amor-Amoros, Paolo Federico, and Silvia Miksch. 2014. TimeGraph: A data management framework for visual analytics of large multivariate time-oriented networks. In Proceedings of the 9th IEEE Conference on Visual Analytics Science and Technology, IEEE VAST 2014."},{"key":"e_1_3_2_7_2","volume-title":"Proceedings of the 13th International World Wide Web Conference, WWW 2004","author":"Boldi Paolo","year":"2004","unstructured":"Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph framework I: Compression techniques. In Proceedings of the 13th International World Wide Web Conference, WWW 2004."},{"key":"e_1_3_2_8_2","volume-title":"Proceedings of the 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020","author":"Boughzala Dorra","year":"2020","unstructured":"Dorra Boughzala, Laurent Lef\u00e8vre, and Anne-C\u00e9cile Orgerie. 2020. Predicting the energy consumption of CUDA kernels using SimGrid. In Proceedings of the 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476159"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.13"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.14778\/3587136.3587140"},{"key":"e_1_3_2_12_2","volume-title":"Proceedings of the WAW","author":"Fogaras D.","year":"2004","unstructured":"D. Fogaras and B. R\u00e1cz. 2004. Towards scaling fully personalized PageRank. In Proceedings of the WAW."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.14778\/3384345.3384358"},{"key":"e_1_3_2_14_2","unstructured":"Duane Merrill and Andrew Grimshaw. 2009. Parallel scan for stream architectures. University of Virginia Department of Computer Science Charlottesville VA USA Technical Report CS2009-14 4 (2009)."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939754"},{"key":"e_1_3_2_16_2","volume-title":"Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017","author":"Han Wei","year":"2017","unstructured":"Wei Han, Daniel Mawhirter, Bo Wu, and Matthew Buland. 2017. Graphie: Large-scale asynchronous graph traversals on just a GPU. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017."},{"key":"e_1_3_2_17_2","volume-title":"Proceedings of the 37th International Symposium on Computer Architecture, ISCA 2010","author":"Hong Sunpyo","year":"2010","unstructured":"Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In Proceedings of the 37th International Symposium on Computer Architecture, ISCA 2010."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2018.00015"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00082"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/IWQoS61813.2024.10682859"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3652604"},{"key":"e_1_3_2_22_2","volume-title":"Proceedings of the 2019 ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA)","author":"Imani Mohsen","year":"2019","unstructured":"Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. FloatPIM: In-memory acceleration of deep neural network training with high precision. In Proceedings of the 2019 ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA)."},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Abhinav Jangda Sandeep Polisetty Arjun Guha and Marco Serafini. 2021. Accelerating graph sampling for graph machine learning using GPUs. In EuroSys\u201921: Sixteenth European Conference on Computer Systems Antonio Barbalace Pramod Bhatotia Lorenzo Alvisi and Cristian Cadar (Eds.). Online Event United Kingdom April 26-28 2021 ACM 311\u2013326.","DOI":"10.1145\/3447786.3456244"},{"key":"e_1_3_2_24_2","article-title":"ReHy: A ReRAM-based digital\/analog hybrid PIM architecture for accelerating CNN training","author":"Jin Hai","year":"2022","unstructured":"Hai Jin, Cong Liu, Haikun Liu, Ruikun Luo, Jiahong Xu, Fubing Mao, and Xiaofei Liao. 2022. ReHy: A ReRAM-based digital\/analog hybrid PIM architecture for accelerating CNN training. IEEE Transactions on Parallel and Distributed Systems 33, 11 (2022), 2872\u20132884.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_2_25_2","volume-title":"Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016","author":"Kim Min-Soo","year":"2016","unstructured":"Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim. 2016. GTS: A fast and scalable graph processing method based on streaming topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016."},{"key":"e_1_3_2_26_2","volume-title":"Proceedings of the10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012","author":"Kyrola Aapo","year":"2012","unstructured":"Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3589334.3645405"},{"key":"e_1_3_2_28_2","article-title":"PowerWalk: Scalable personalized PageRank via random walks with vertex-centric decomposition","author":"Liu Q.","year":"2016","unstructured":"Q. Liu, Zhenguo Li, J. Lui, and Jiefeng Cheng. 2016. PowerWalk: Scalable personalized PageRank via random walks with vertex-centric decomposition. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management.","journal-title":"In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management."},{"key":"e_1_3_2_29_2","volume-title":"Proceedings of the USENIX Annual Technical Conference, USENIX ATC 2017","author":"Ma Lingxiao","year":"2017","unstructured":"Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication. In Proceedings of the USENIX Annual Technical Conference, USENIX ATC 2017."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807184"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.14778\/3659437.3659438"},{"key":"e_1_3_2_32_2","article-title":"Nvidia. Multi-Instance GPU","unstructured":"Nvidia. 2022. Nvidia. Multi-Instance GPU. Retrieved April 1, 2022 from https:\/\/docs.nvidia.com\/cuda\/mig\/index.html","journal-title":"https:\/\/docs.nvidia.com\/cuda\/mig\/index.html"},{"key":"e_1_3_2_33_2","unstructured":"NVIDIA Corporation. 2019. Multi-process Service."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2017.40"},{"key":"e_1_3_2_35_2","article-title":"CongraPlus: Towards efficient processing of concurrent graph queries on NUMA machines","author":"Pan Peitian","year":"2019","unstructured":"Peitian Pan, Chao Li, and Minyi Guo. 2019. CongraPlus: Towards efficient processing of concurrent graph queries on NUMA machines. IEEE Transactions on Parallel and Distributed Systems 30, 9 (2019), 1990\u20132002.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","unstructured":"Santosh Pandey Lingda Li Adolfy Hoisie Xiaoye S. Li and Hang Liu. 2020. C-SAW: A framework for graph sampling and random walk on GPUs. In SC20: International Conference for High Performance Computing Networking Storage and Analysis IEEE 1\u201315.","DOI":"10.1109\/SC41405.2020.00060"},{"key":"e_1_3_2_37_2","article-title":"Dynamic resource management for efficient utilization of multitasking GPUs","author":"Park Jason Jong Kyu","year":"2017","unstructured":"Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic resource management for efficient utilization of multitasking GPUs. ACM SIGPLAN Notices (2017), 527\u2013540.","journal-title":"ACM SIGPLAN Notices"},{"key":"e_1_3_2_38_2","volume-title":"Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","author":"Shafiee Ali","unstructured":"Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 2016 ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)."},{"key":"e_1_3_2_39_2","article-title":"ThunderRW: An in-memory graph random walk engine","author":"Sun Shixuan","year":"2021","unstructured":"Shixuan Sun, Yuhang Chen, Shengliang Lu, Bingsheng He, and Yuchen Li. 2021. ThunderRW: An in-memory graph random walk engine. Proceedings of the VLDB Endowment 14, 11 (2021), 1992\u20132005.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_3_2_40_2","article-title":"Billion-scale commodity embedding for e-commerce recommendation in Alibaba","author":"Wang Jizhe","year":"2018","unstructured":"Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, B. Zhao, and D. Lee. 2018. Billion-scale commodity embedding for e-commerce recommendation in Alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.","journal-title":"In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining."},{"key":"e_1_3_2_41_2","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201924)","author":"Wang Jing","year":"2024","unstructured":"Jing Wang, Hanzhang Yang, Chao Li, Yiming Zhuansun, Wang Yuan, Cheng Xu, Xiaofeng Hou, Minyi Guo, Yang Hu, and Yaqian Zhao. 2024. Boosting data center performance via intelligently managed multi-backend disaggregated memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201924). IEEE, Article 37, 18 pages."},{"key":"e_1_3_2_42_2","unstructured":"Minjie Wang Lingfan Yu Da Zheng Quan Gan Yu Gai Zihao Ye Mufei Li Jinjing Zhou Qi Huang Chao Ma et\u00a0al. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. In Proceedings oftheICLR Workshop on Representation Learning on Graphs and Manifolds."},{"key":"e_1_3_2_43_2","volume-title":"Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques, PACT 2021","author":"Wang Pengyu","year":"2021","unstructured":"Pengyu Wang, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Jingwen Leng, Quan Chen, and Minyi Guo. 2021. Skywalker: Efficient alias-method-based graph sampling and random walk on GPUs. In Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques, PACT 2021."},{"issue":"2","key":"e_1_3_2_44_2","article-title":"Grus: Toward unified-memory-efficient high-performance graph processing on GPU","volume":"18","author":"Wang Pengyu","year":"2021","unstructured":"Pengyu Wang, Jing Wang, Chao Li, Jianzong Wang, Haojin Zhu, and Minyi Guo. 2021. Grus: Toward unified-memory-efficient high-performance graph processing on GPU. ACM Transactions on Architecture and Code Optimization 18, 2 (2021).","journal-title":"ACM Transactions on Architecture and Code Optimization"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2023.3251860"},{"key":"e_1_3_2_46_2","volume-title":"Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019","author":"Wang Pengyu","year":"2019","unstructured":"Pengyu Wang, Lu Zhang, Chao Li, and Minyi Guo. 2019. Excavating the potential of GPU for accelerating graph traversal. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019."},{"key":"e_1_3_2_47_2","volume-title":"Proceedings of the USENIX Annual Technical Conference 2020","author":"Wang Rui","year":"2020","unstructured":"Rui Wang, Y. Li, H. Xie, Yinlong Xu, and J. Lui. 2020. GraphWalker: An I\/O-efficient and resource-friendly graph analytic system for fast and scalable random walks. In Proceedings of the USENIX Annual Technical Conference 2020."},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3582016.3582025"},{"key":"e_1_3_2_49_2","volume-title":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015","author":"Wang Yangzihao","year":"2015","unstructured":"Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015."},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2978386"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3572848.3577482"},{"key":"e_1_3_2_52_2","article-title":"Processing concurrent graph analytics with decoupled computation model","author":"Xue Jilong","year":"2017","unstructured":"Jilong Xue, Zhi Yang, Shian Hou, and Yafei Dai. 2017. Processing concurrent graph analytics with decoupled computation model. IEEE Transactions on Computers (2017).","journal-title":"IEEE Transactions on Computers"},{"key":"e_1_3_2_53_2","volume-title":"Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC\u201914","author":"Xue Jilong","year":"2014","unstructured":"Jilong Xue, Zhi Yang, Zhi Qu, Shian Hou, and Yafei Dai. 2014. Seraph: An efficient, low-cost system for concurrent graph processing. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC\u201914."},{"key":"e_1_3_2_54_2","volume-title":"Proceedings of the12th IEEE International Conference on Data Mining, ICDM 2012","author":"Yang Jaewon","year":"2012","unstructured":"Jaewon Yang and Jure Leskovec. 2012. Defining and evaluating network communities based on ground-truth. In Proceedings of the12th IEEE International Conference on Data Mining, ICDM 2012."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3492321.3519557"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477132.3483575"},{"key":"e_1_3_2_57_2","volume-title":"Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019","author":"Yang Ke","year":"2019","unstructured":"Ke Yang, Mingxing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, and Yong Jiang. 2019. KnightKing: A fast distributed graph random walk engine. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019."},{"key":"e_1_3_2_58_2","volume-title":"Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2017","author":"Yeh Tsung Tai","year":"2017","unstructured":"Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, Rudolf Eigenmann, and Timothy G. Rogers. 2017. Pagoda: Fine-grained GPU resource virtualization for narrow tasks. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2017."},{"key":"e_1_3_2_59_2","article-title":"Toward QoS-awareness and improved utilization of spatial multitasking GPUs","author":"Zhang Wei","year":"2022","unstructured":"Wei Zhang, Quan Chen, Ningxin Zheng, Weihao Cui, Kaihua Fu, and Minyi Guo. 2022. Toward QoS-awareness and improved utilization of spatial multitasking GPUs. IEEE Transactions on Computers 71, 4 (2022), 866\u2013879.","journal-title":"IEEE Transactions on Computers"},{"key":"e_1_3_2_60_2","volume-title":"Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018","author":"Zhang Yu","year":"2018","unstructured":"Yu Zhang, Xiaofei Liao, Hai Jin, Lin Gu, Ligang He, Bingsheng He, and Haikun Liu. 2018. CGraph: A correlations-aware approach for efficient concurrent iterative graph processing. In Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018."},{"key":"e_1_3_2_61_2","volume-title":"Proceedings of the SC\u201921: The International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Zhao Jin","year":"2021","unstructured":"Jin Zhao, Yu Zhang, Xiaofei Liao, Ligang He, Bingsheng He, Hai Jin, and Haikun Liu. 2021. LCCG: A locality-centric hardware accelerator for high throughput of concurrent graph processing. In Proceedings of the SC\u201921: The International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356143"},{"key":"e_1_3_2_63_2","volume-title":"Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems","author":"Zhao Xia","year":"2020","unstructured":"Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. HSM: A hybrid slowdown model for multitasking GPUs. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems."},{"key":"e_1_3_2_64_2","article-title":"Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling","author":"Zhong Jianlong","year":"2014","unstructured":"Jianlong Zhong and Bingsheng He. 2014. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1522\u20131532.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711820","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711820","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:19:15Z","timestamp":1750295955000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711820"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,21]]},"references-count":63,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3711820"],"URL":"https:\/\/doi.org\/10.1145\/3711820","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,21]]},"assertion":[{"value":"2024-09-09","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-21","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}