{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,16]],"date-time":"2026-05-16T04:00:20Z","timestamp":1778904020733,"version":"3.51.4"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,2,9]],"date-time":"2021-02-09T00:00:00Z","timestamp":1612828800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research 8 Development Program of China","award":["2018YFB1003503"],"award-info":[{"award-number":["2018YFB1003503"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61972247"],"award-info":[{"award-number":["61972247"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,6,30]]},"abstract":"<jats:p>Today\u2019s GPU graph processing frameworks face scalability and efficiency issues as the graph size exceeds GPU-dedicated memory limit. Although recent GPUs can over-subscribe memory with Unified Memory (UM), they incur significant overhead when handling graph-structured data. In addition, many popular processing frameworks suffer sub-optimal efficiency due to heavy atomic operations when tracking the active vertices. This article presents Grus, a novel system framework that allows GPU graph processing to stay competitive with the ever-growing graph complexity. Grus improves space efficiency through a UM trimming scheme tailored to the data access behaviors of graph workloads. It also uses a lightweight frontier structure to further reduce atomic operations. With easy-to-use interface that abstracts the above details, Grus shows up to 6.4\u00d7 average speedup over the state-of-the-art in-memory GPU graph processing framework. It allows one to process large graphs of 5.5 billion edges in seconds with a single GPU.<\/jats:p>","DOI":"10.1145\/3444844","type":"journal-article","created":{"date-parts":[[2021,2,10]],"date-time":"2021-02-10T14:29:54Z","timestamp":1612967394000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":42,"title":["Grus"],"prefix":"10.1145","volume":"18","author":[{"given":"Pengyu","family":"Wang","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"given":"Jing","family":"Wang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"given":"Chao","family":"Li","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"given":"Jianzong","family":"Wang","sequence":"additional","affiliation":[{"name":"Ping An Technology, Guangdong, China"}]},{"given":"Haojin","family":"Zhu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"given":"Minyi","family":"Guo","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2021,2,9]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Andy Adinets. 2014. CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics | NVIDIA Developer Blog. Retrieved from https:\/\/developer.nvidia.com\/blog\/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics\/.  Andy Adinets. 2014. CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics | NVIDIA Developer Blog. Retrieved from https:\/\/developer.nvidia.com\/blog\/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics\/."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/3273982.3273986"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173169"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018756"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2017.93"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/988672.988752"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201919)","author":"Brokhman Tanya","year":"2019","unstructured":"Tanya Brokhman , Pavel Lifshits , and Mark Silberstein . 2019 . GAIA: An OS page cache for heterogeneous systems . In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201919) . 661--674. Tanya Brokhman, Pavel Lifshits, and Mark Silberstein. 2019. GAIA: An OS page cache for heterogeneous systems. In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201919). 661--674."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-016-0448-z"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_10_1","volume-title":"Keckler","author":"Choukse Esha","year":"2019","unstructured":"Esha Choukse , Michael B. Sullivan , Mike O\u2019Connor , Mattan Erez , Jeff Pool , David W. Nellans , and Stephen W . Keckler . 2019 . Buddy compression: Enabling larger memory for deep learning and HPC workloads on GPUs. Retrieved from https:\/\/arxiv:1903.02596. Esha Choukse, Michael B. Sullivan, Mike O\u2019Connor, Mattan Erez, Jeff Pool, David W. Nellans, and Stephen W. Keckler. 2019. Buddy compression: Enabling larger memory for deep learning and HPC workloads on GPUs. Retrieved from https:\/\/arxiv:1903.02596."},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU\u201910)","author":"Danalis Anthony","unstructured":"Anthony Danalis , Gabriel Marin , Collin McCurdy , Jeremy S. Meredith , Philip C. Roth , Kyle Spafford , Vinod Tipparaju , and Jeffrey S. Vetter . 2010. The scalable heterogeneous computing (SHOC) benchmark suite . In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU\u201910) . Association for Computing Machinery, New York, NY, 63--74. DOI:https:\/\/doi.org\/10.1145\/1735688.1735702 10.1145\/1735688.1735702 Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU\u201910). Association for Computing Machinery, New York, NY, 63--74. DOI:https:\/\/doi.org\/10.1145\/1735688.1735702"},{"key":"e_1_2_1_12_1","unstructured":"Advanced Micro Devices. 2017. Radeon\u2019s next-generation Vega architecture. Retrieved from https:\/\/www.techpowerup.com\/gpu-specs\/docs\/amd-vega-architecture.pdf.  Advanced Micro Devices. 2017. Radeon\u2019s next-generation Vega architecture. Retrieved from https:\/\/www.techpowerup.com\/gpu-specs\/docs\/amd-vega-architecture.pdf."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the International Symposium on Memory Management (ISMM\u201914)","author":"Egielski Ian J.","unstructured":"Ian J. Egielski , Jesse Huang , and Eddy Z. Zhang . 2014. Massive atomics for massive parallelism on GPUs . In Proceedings of the International Symposium on Memory Management (ISMM\u201914) . 93--103. Ian J. Egielski, Jesse Huang, and Eddy Z. Zhang. 2014. Massive atomics for massive parallelism on GPUs. In Proceedings of the International Symposium on Memory Management (ISMM\u201914). 93--103."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2011.34"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 7th IEEE\/ACM International Symposium on Networks-on-Chip (NoCS\u201913)","author":"Franey Sean","unstructured":"Sean Franey and Mikko H. Lipasti . 2013. Accelerating atomic operations on GPGPUs . In Proceedings of the 7th IEEE\/ACM International Symposium on Networks-on-Chip (NoCS\u201913) . 1--8. Sean Franey and Mikko H. Lipasti. 2013. Accelerating atomic operations on GPGPUs. In Proceedings of the 7th IEEE\/ACM International Symposium on Networks-on-Chip (NoCS\u201913). 1--8."},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the International Semantic Web Conference.","author":"Fu Zhisong","unstructured":"Zhisong Fu , Harish Kumar Dasari , Bradley R. Bebee , Martin Berzins , and Bryan B. Thompson . 2014. MapGraph\u2014Graphprocessing at 30 billion edges per second on NVIDIA GPUs . In Proceedings of the International Semantic Web Conference. Zhisong Fu, Harish Kumar Dasari, Bradley R. Bebee, Martin Berzins, and Bryan B. Thompson. 2014. MapGraph\u2014Graphprocessing at 30 billion edges per second on NVIDIA GPUs. In Proceedings of the International Semantic Web Conference."},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 46th International Symposium on Computer Architecture (ISCA\u201919)","author":"Ganguly Debashis","unstructured":"Debashis Ganguly , Ziyu Zhang , Jun Yang , and Rami G. Melhem . 2019. Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory . In Proceedings of the 46th International Symposium on Computer Architecture (ISCA\u201919) . 224--235. Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami G. Melhem. 2019. Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA\u201919). 224--235."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370866"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2012.319"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201917)","author":"G\u00f3mez-Luna Juan","unstructured":"Juan G\u00f3mez-Luna , Izzat El Hajj , Li-Wen Chang , Victor Garcia-Flores , Simon Garcia De Gonzalo , Thomas B. Jablin , Antonio J. Pe\u00f1a , and Wen-mei W. Hwu . 2017. Chai: Collaborative heterogeneous applications for integrated-architectures . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201917) . 43--54. Juan G\u00f3mez-Luna, Izzat El Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia De Gonzalo, Thomas B. Jablin, Antonio J. Pe\u00f1a, and Wen-mei W. Hwu. 2017. Chai: Collaborative heterogeneous applications for integrated-architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201917). 43--54."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2017.41"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917)","author":"Hong Changwan","unstructured":"Changwan Hong , Aravind Sukumaran-Rajam , Jinsung Kim , and P. Sadayappan . 2017. MultiGraph: Efficient graph processing on GPUs . In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917) . 27--40. Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, and P. Sadayappan. 2017. MultiGraph: Efficient graph processing on GPUs. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917). 27--40."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1941553.1941590"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3157794.3157799"},{"key":"e_1_2_1_25_1","volume-title":"Scarpazza","author":"Jia Zhe","year":"2018","unstructured":"Zhe Jia , Marco Maggioni , Benjamin Staiger , and Daniele P . Scarpazza . 2018 . Dissecting the NVIDIA volta GPU architecture via microbenchmarking. Retrieved from https:\/\/arXiv:1804.06826. Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. Retrieved from https:\/\/arXiv:1804.06826."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358286"},{"key":"e_1_2_1_27_1","unstructured":"Farzad Khorasani. 2014. Multi-threaded Large-Scale RMAT Graph Generator. Retrieved from https:\/\/github.com\/farkhor\/PaRMAT.  Farzad Khorasani. 2014. Multi-threaded Large-Scale RMAT Graph Generator. Retrieved from https:\/\/github.com\/farkhor\/PaRMAT."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201914)","author":"Khorasani Farzad","unstructured":"Farzad Khorasani , Keval Vora , Rajiv Gupta , and Laxmi N. Bhuyan . 2014. CuSha: Vertex-centric graph processing on GPUs . In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201914) . 239--252. Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan. 2014. CuSha: Vertex-centric graph processing on GPUs. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201914). 239--252."},{"key":"e_1_2_1_29_1","unstructured":"Khronos OpenCL Working Group. 2020. The OpenCL C 3.0 Specification (Provisional). Retrieved from https:\/\/www.khronos.org\/registry\/OpenCL\/specs\/3.0-unified\/html\/OpenCL_C.html.  Khronos OpenCL Working Group. 2020. The OpenCL C 3.0 Specification (Provisional). Retrieved from https:\/\/www.khronos.org\/registry\/OpenCL\/specs\/3.0-unified\/html\/OpenCL_C.html."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378529"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915204"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201912)","author":"Kyrola Aapo","year":"2012","unstructured":"Aapo Kyrola , Guy E. Blelloch , and Carlos Guestrin . 2012 . GraphChi: Large-scale graph computation on just a PC . In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201912) . 31--46. Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201912). 31--46."},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201914)","author":"Landaverde Raphael","unstructured":"Raphael Landaverde , Tiansheng Zhang , Ayse K. Coskun , and Martin C. Herbordt . 2014. An investigation of unified memory access performance in CUDA . In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201914) . 1--6. Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun, and Martin C. Herbordt. 2014. An investigation of unified memory access performance in CUDA. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201914). 1--6."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304044"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201919)","author":"Li Lingda","unstructured":"Lingda Li and Barbara M. Chapman . 2019. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space . In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201919) . 51:1--51:16. Lingda Li and Barbara M. Chapman. 2019. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201919). 51:1--51:16."},{"key":"e_1_2_1_36_1","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201919)","author":"Liu Hang","unstructured":"Hang Liu and H. Howie Huang . 2019. SIMD-X: Programming and processing of graph algorithms on GPUs . In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201919) . 411--428. Hang Liu and H. Howie Huang. 2019. SIMD-X: Programming and processing of graph algorithms on GPUs. In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201919). 411--428."},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201917)","author":"Ma Lingxiao","year":"2017","unstructured":"Lingxiao Ma , Zhi Yang , Han Chen , Jilong Xue , and Yafei Dai . 2017 . Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication . In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201917) . 195--207. Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication. In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201917). 195--207."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2018.8547517"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807184"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00035"},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914)","author":"McLaughlin Adam","unstructured":"Adam McLaughlin and David A. Bader . 2014. Scalable and high performance betweenness centrality on the GPU . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914) . 572--583. Adam McLaughlin and David A. Bader. 2014. Scalable and high performance betweenness centrality on the GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914). 572--583."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295716"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370036.2145832"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2458523.2458533"},{"key":"e_1_2_1_45_1","volume-title":"NVIDIA Tesla V100 GPU architecture. White Paperv1.1","author":"NVIDIA.","year":"2017","unstructured":"NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. White Paperv1.1 ( 2017 ), 53. Retrieved from http:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf. NVIDIA. 2017. NVIDIA Tesla V100 GPU architecture. White Paperv1.1 (2017), 53. Retrieved from http:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf."},{"key":"e_1_2_1_46_1","unstructured":"NVIDIA. 2018. NVIDIA Pascal Architecture. https:\/\/www.nvidia.com\/en-us\/data-center\/pascal-gpu-architecture\/.  NVIDIA. 2018. NVIDIA Pascal Architecture. https:\/\/www.nvidia.com\/en-us\/data-center\/pascal-gpu-architecture\/."},{"key":"e_1_2_1_47_1","unstructured":"NVIDIA. 2020. Nsight Compute CLI :: Nsight Compute Documentation. Retrieved from https:\/\/docs.nvidia.com\/nsight-compute\/NsightComputeCli\/index.html.  NVIDIA. 2020. Nsight Compute CLI :: Nsight Compute Documentation. Retrieved from https:\/\/docs.nvidia.com\/nsight-compute\/NsightComputeCli\/index.html."},{"key":"e_1_2_1_48_1","unstructured":"NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. Retrieved from https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf.  NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. Retrieved from https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-ampere-architecture-whitepaper.pdf."},{"key":"e_1_2_1_49_1","volume-title":"Programming Guide: CUDA Toolkit Documentation.","author":"NVIDIA.","year":"2020","unstructured":"NVIDIA. 2020 . Programming Guide: CUDA Toolkit Documentation. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html. NVIDIA. 2020. Programming Guide: CUDA Toolkit Documentation. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html."},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201919)","author":"Pearson Carl","unstructured":"Carl Pearson , Mohammad Almasri , Omer Anjum , Vikram S. Mailthody , Zaid Qureshi , Rakesh Nagi , Jinjun Xiong , and Wen-Mei W. Hwu . 2019. Update on triangle counting on GPU . In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201919) . 1--7. Carl Pearson, Mohammad Almasri, Omer Anjum, Vikram S. Mailthody, Zaid Qureshi, Rakesh Nagi, Jinjun Xiong, and Wen-Mei W. Hwu. 2019. Update on triangle counting on GPU. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201919). 1--7."},{"key":"e_1_2_1_51_1","unstructured":"Usman Pirzada. 2019. Intel Hints Towards An Xe \u201cCoherent Multi-GPU\u201d Future With CXL Interconnect. Retrieved from https:\/\/wccftech.com\/intel-xe-coherent-multi-gpu-cxl\/.  Usman Pirzada. 2019. Intel Hints Towards An Xe \u201cCoherent Multi-GPU\u201d Future With CXL Interconnect. Retrieved from https:\/\/wccftech.com\/intel-xe-coherent-multi-gpu-cxl\/."},{"key":"e_1_2_1_52_1","volume-title":"Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Rhu Minsoo","unstructured":"Minsoo Rhu , Natalia Gimelshein , Jason Clemons , Arslan Zulfiqar , and Stephen W. Keckler . 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design . In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916) . 18:1--18:13. Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). 18:1--18:13."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517349.2522740"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173180"},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the 15th EuroSys Conference (EUROSYS\u201920)","author":"Nodehi Sabet Amir Hossein","year":"2020","unstructured":"Amir Hossein Nodehi Sabet , Zhijia Zhao , and Rajiv Gupta . 2020 . Subway: Minimizing data transfer during out-of-GPU-memory graph processing . In Proceedings of the 15th EuroSys Conference (EUROSYS\u201920) . 12:1--12:16. Amir Hossein Nodehi Sabet, Zhijia Zhao, and Rajiv Gupta. 2020. Subway: Minimizing data transfer during out-of-GPU-memory graph processing. In Proceedings of the 15th EuroSys Conference (EUROSYS\u201920). 12:1--12:16."},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the NVIDIA GPU Technology Conference (GTC\u201918)","author":"Sakharnykh Nikolay","year":"2018","unstructured":"Nikolay Sakharnykh . 2018 . Everything you need to know about unified memory . In Proceedings of the NVIDIA GPU Technology Conference (GTC\u201918) . Nikolay Sakharnykh. 2018. Everything you need to know about unified memory. In Proceedings of the NVIDIA GPU Technology Conference (GTC\u201918)."},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/2967938.2967944"},{"key":"e_1_2_1_58_1","volume-title":"Proceedings of the 46th International Symposium on Computer Architecture (ISCA\u201919)","author":"Sun Yifan","unstructured":"Yifan Sun , Trinayan Baruah , Saiful A. Mojumder , Shi Dong , Xiang Gong , Shane Treadway , Yuhui Bao , Spencer Hance , Carter McCardwell , Vincent Zhao , Harrison Barclay , Amir Kavyan Ziabari , Zhongliang Chen , Rafael Ubal , Jos\u00e9 L. Abell\u00e1n , John Kim , Ajay Joshi , and David R. Kaeli . 2019. MGPUSim: Enabling multi-GPU performance modeling and optimization . In Proceedings of the 46th International Symposium on Computer Architecture (ISCA\u201919) . 197--209. Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, Jos\u00e9 L. Abell\u00e1n, John Kim, Ajay Joshi, and David R. Kaeli. 2019. MGPUSim: Enabling multi-GPU performance modeling and optimization. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA\u201919). 197--209."},{"key":"e_1_2_1_59_1","volume-title":"Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201916)","author":"Sun Yifan","unstructured":"Yifan Sun , Xiang Gong , Amir Kavyan Ziabari , Leiming Yu , Xiangyu Li , Saoni Mukherjee , Carter McCardwell , Alejandro Villegas , and David R. Kaeli . 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing . In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201916) . 13--22. Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David R. Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC\u201916). 13--22."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295733"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2019.8916506"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2019.00032"},{"key":"e_1_2_1_63_1","volume-title":"Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201915)","author":"Wang Yangzihao","unstructured":"Yangzihao Wang , Andrew A. Davidson , Yuechao Pan , Yuduo Wu , Andy Riffel , and John D. Owens . 2015. Gunrock: A high-performance graph processing library on the GPU . In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201915) . 265--266. Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP\u201915). 265--266."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-662-48096-0_34"},{"key":"e_1_2_1_65_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPS\u201915)","author":"Yang Carl","unstructured":"Carl Yang , Yangzihao Wang , and John D. Owens . 2015. Fast sparse matrix and sparse vector multiplication algorithm on the GPU . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPS\u201915) . 841--847. Carl Yang, Yangzihao Wang, and John D. Owens. 2015. Fast sparse matrix and sparse vector multiplication algorithm on the GPU. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPS\u201915). 841--847."},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the 12th IEEE International Conference on Data Mining (ICDM\u201912)","author":"Yang Jaewon","year":"2012","unstructured":"Jaewon Yang and Jure Leskovec . 2012 . Defining and evaluating network communities based on ground-truth . In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM\u201912) . 745--754. Jaewon Yang and Jure Leskovec. 2012. Defining and evaluating network communities based on ground-truth. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM\u201912). 745--754."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW.2016.185"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3416495"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304029"},{"key":"e_1_2_1_70_1","volume-title":"Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201920)","author":"Zheng Long","year":"2020","unstructured":"Long Zheng , Xianliang Li , Yaohui Zheng , Yu Huang , Xiaofei Liao , Hai Jin , Jingling Xue , Zhiyuan Shao , and Qiang-Sheng Hua . 2020 . Scaph: Scalable GPU-accelerated graph processing with value-driven differential scheduling . In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201920) , Ada Gavrilovska and Erez Zadok (Eds.). USENIX Association, 573--588. Long Zheng, Xianliang Li, Yaohui Zheng, Yu Huang, Xiaofei Liao, Hai Jin, Jingling Xue, Zhiyuan Shao, and Qiang-Sheng Hua. 2020. Scaph: Scalable GPU-accelerated graph processing with value-driven differential scheduling. In Proceedings of the USENIX Annual Technical Conference (USENIXATC\u201920), Ada Gavrilovska and Erez Zadok (Eds.). USENIX Association, 573--588."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3444844","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3444844","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:28:13Z","timestamp":1750195693000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3444844"}},"subtitle":["Toward Unified-memory-efficient High-performance Graph Processing on GPU"],"short-title":[],"issued":{"date-parts":[[2021,2,9]]},"references-count":70,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,6,30]]}},"alternative-id":["10.1145\/3444844"],"URL":"https:\/\/doi.org\/10.1145\/3444844","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,2,9]]},"assertion":[{"value":"2020-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-02-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}