{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T10:59:55Z","timestamp":1777546795481,"version":"3.51.4"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,5,10]],"date-time":"2021-05-10T00:00:00Z","timestamp":1620604800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"NSF of China","doi-asserted-by":"crossref","award":["61872374 and 61672526"],"award-info":[{"award-number":["61872374 and 61672526"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,9,30]]},"abstract":"<jats:p>Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. Processing graphs on GPUs introduces several problems, such as load imbalance, low utilization of hardware unit, and memory divergence. Although previous work has proposed several software strategies to optimize graph processing on GPUs, there are several issues beyond the capability of software techniques to address.<\/jats:p>\n          <jats:p>In this article, we present GraphPEG, a graph processing engine for efficient graph processing on GPUs. Inspired by the observation that many graph algorithms have a common pattern on graph traversal, GraphPEG improves the performance of graph processing by coupling automatic edge gathering with fine-grain work distribution. GraphPEG can also adapt to various input graph datasets and simplify the software design of graph processing with hardware-assisted graph traversal. Simulation results show that, in comparison with two representative highly efficient GPU graph processing software framework Gunrock and SEP-Graph, GraphPEG improves graph processing throughput by 2.8\u00d7 and 2.5\u00d7 on average, and up to 7.3\u00d7 and 7.0\u00d7 for six graph algorithm benchmarks on six graph datasets, with marginal hardware cost.<\/jats:p>","DOI":"10.1145\/3450440","type":"journal-article","created":{"date-parts":[[2021,5,10]],"date-time":"2021-05-10T22:15:17Z","timestamp":1620684917000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["GraphPEG"],"prefix":"10.1145","volume":"18","author":[{"given":"Yashuai","family":"L\u00fc","sequence":"first","affiliation":[{"name":"Space Engineering University, China"}]},{"given":"Hui","family":"Guo","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Libo","family":"Huang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Qi","family":"Yu","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Li","family":"Shen","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Nong","family":"Xiao","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]},{"given":"Zhiying","family":"Wang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology, China"}]}],"member":"320","published-online":{"date-parts":[[2021,5,10]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2872887.2750386"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1090\/conm\/588"},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909)","author":"Bakhoda Ali","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909) . IEEE, 163--174. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201909). IEEE, 163--174."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.59"},{"key":"e_1_2_1_5_1","unstructured":"S. Baxter. 2013. Modern GPU library.  S. Baxter. 2013. Modern GPU library."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018756"},{"key":"e_1_2_1_7_1","unstructured":"Xuhao Chen. 2019. GraphCage: Cache aware graph processing on GPUs. Retrieved from http:\/\/arxiv.org\/abs\/1904.02241.  Xuhao Chen. 2019. GraphCage: Cache aware graph processing on GPUs. Retrieved from http:\/\/arxiv.org\/abs\/1904.02241."},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of 28th International Parallel and Distributed Processing Symposium. IEEE, 349--359","author":"Davidson Andrew Alan","unstructured":"Andrew Alan Davidson , Sean Baxter , Michael Garland , and John D. Owens . 2014. Work-efficient parallel GPU methods for single-source shortest paths . In Proceedings of 28th International Parallel and Distributed Processing Symposium. IEEE, 349--359 . Andrew Alan Davidson, Sean Baxter, Michael Garland, and John D. Owens. 2014. Work-efficient parallel GPU methods for single-source shortest paths. In Proceedings of 28th International Parallel and Distributed Processing Symposium. IEEE, 349--359."},{"key":"e_1_2_1_9_1","unstructured":"Erich Elsen and Vishal Vaidyanathan. 2014. VertexAPI2\u2013A vertex-program API for large graph computations on the GPU. Retrieved from https:\/\/github.com\/gunrock\/vertexAPI2.  Erich Elsen and Vishal Vaidyanathan. 2014. VertexAPI2\u2013A vertex-program API for large graph computations on the GPU. Retrieved from https:\/\/github.com\/gunrock\/vertexAPI2."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2621934.2621936"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201912)","volume":"12","author":"Gonzalez Joseph E.","year":"2012","unstructured":"Joseph E. Gonzalez , Yucheng Low , Haijie Gu , Danny Bickson , and Carlos Guestrin . 2012 . Powergraph: Distributed graph-parallel computation on natural graphs . In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201912) , Vol. 12 . 2. Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. Powergraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201912), Vol. 12. 2."},{"key":"e_1_2_1_12_1","first-page":"1","article-title":"The parallel BGL: A generic library for distributed graph computations","volume":"2","author":"Gregor Douglas","year":"2005","unstructured":"Douglas Gregor and Andrew Lumsdaine . 2005 . The parallel BGL: A generic library for distributed graph computations . Parallel Object-Orient. Sci. Comput. 2 (2005), 1 -- 18 . Douglas Gregor and Andrew Lumsdaine. 2005. The parallel BGL: A generic library for distributed graph computations. Parallel Object-Orient. Sci. Comput. 2 (2005), 1--18.","journal-title":"Parallel Object-Orient. Sci. Comput."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783759"},{"key":"e_1_2_1_14_1","volume-title":"Proceedings of the International Conference on High-performance Computing. Springer, 197--208","author":"Harish Pawan","unstructured":"Pawan Harish and P. J. Narayanan . 2007. Accelerating large graph algorithms on the GPU using CUDA . In Proceedings of the International Conference on High-performance Computing. Springer, 197--208 . Pawan Harish and P. J. Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the International Conference on High-performance Computing. Springer, 197--208."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2189750.2151013"},{"key":"e_1_2_1_16_1","volume-title":"Tayo Oguntebi, and Kunle Olukotun.","author":"Hong Sungpack","year":"2011","unstructured":"Sungpack Hong , Sang Kyun Kim , Tayo Oguntebi, and Kunle Olukotun. 2011 . Accelerating CUDA graph algorithms at maximum warp. In ACM SIGPLAN Notices, Vol. 46 . ACM , 267--276. Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011. Accelerating CUDA graph algorithms at maximum warp. In ACM SIGPLAN Notices, Vol. 46. ACM, 267--276."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.14"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the International Conference on Parallel Architecture and Compilation (PACT\u201915)","author":"Khorasani Farzad","unstructured":"Farzad Khorasani , Rajiv Gupta , and Laxmi N. Bhuyan . 2015. Scalable simd-efficient graph processing on gpus . In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT\u201915) . IEEE, 39--50. Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Scalable simd-efficient graph processing on gpus. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT\u201915). IEEE, 39--50."},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing. ACM, 239--252","author":"Khorasani Farzad","unstructured":"Farzad Khorasani , Keval Vora , Rajiv Gupta , and Laxmi N. Bhuyan . 2014. CuSha: Vertex-centric graph processing on GPUs . In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing. ACM, 239--252 . Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan. 2014. CuSha: Vertex-centric graph processing on GPUs. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing. ACM, 239--252."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.24"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1273440.1250683"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914)","author":"Nagesh","unstructured":"Nagesh B. Lakshminarayana and Hyesoon Kim. 2014. Spare register aware prefetching for graph algorithms on GPUs . In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914) . IEEE, 614--625. Nagesh B. Lakshminarayana and Hyesoon Kim. 2014. Spare register aware prefetching for graph algorithms on GPUs. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914). IEEE, 614--625."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2012.2202699"},{"key":"e_1_2_1_24_1","unstructured":"Jure Leskovec and Andrej Krevl. 2015. SNAP datasets: Stanford Large network dataset collection. Retrieved from https:\/\/snap.stanford.edu\/index.html.  Jure Leskovec and Andrej Krevl. 2015. SNAP datasets: Stanford Large network dataset collection. Retrieved from https:\/\/snap.stanford.edu\/index.html."},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of theInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915)","author":"Liu Hang","unstructured":"Hang Liu and H. Howie Huang . 2015. Enterprise: Breadth-first graph traversal on GPUs . In Proceedings of theInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915) . IEEE, 1--12. Hang Liu and H. Howie Huang. 2015. Enterprise: Breadth-first graph traversal on GPUs. In Proceedings of theInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915). IEEE, 1--12."},{"key":"e_1_2_1_26_1","volume-title":"Graphlab: A new framework for parallel machine learning.","author":"Low Yucheng","year":"2014","unstructured":"Yucheng Low , Joseph E. Gonzalez , Aapo Kyrola , Danny Bickson , Carlos E. Guestrin , and Joseph Hellerstein . 2014 . Graphlab: A new framework for parallel machine learning. Retrieved from https:\/\/arXiv:1408.2041. Yucheng Low, Joseph E. Gonzalez, Aapo Kyrola, Danny Bickson, Carlos E. Guestrin, and Joseph Hellerstein. 2014. Graphlab: A new framework for parallel machine learning. Retrieved from https:\/\/arXiv:1408.2041."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1837274.1837289"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807184"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2008.4629999"},{"key":"e_1_2_1_30_1","volume-title":"Bader","author":"McLaughlin Adam","year":"2014","unstructured":"Adam McLaughlin and David A . Bader . 2014 . Scalable and high performance betweenness centrality on the GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press , 572--583. Adam McLaughlin and David A. Bader. 2014. Scalable and high performance betweenness centrality on the GPU. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 572--583."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370036.2145831"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145832"},{"key":"e_1_2_1_33_1","volume-title":"Ang","author":"Murphy Richard C.","year":"2010","unstructured":"Richard C. Murphy , Kyle B. Wheeler , Brian W. Barrett , and James A . Ang . 2010 . Introducing the graph 500. Cray User\u2019s Group (CUG) 19 (2010), 45--74. Richard C. Murphy, Kyle B. Wheeler, Brian W. Barrett, and James A. Ang. 2010. Introducing the graph 500. Cray User\u2019s Group (CUG) 19 (2010), 45--74."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.54"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2013.28"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2517349.2522739"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173180"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2014.15"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2847263.2847337"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2678373.2665701"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.24"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSCS.2009.5412615"},{"key":"e_1_2_1_43_1","volume-title":"Proceedings of the 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83","author":"Rogers Timothy G.","unstructured":"Timothy G. Rogers , Mike O\u2019Connor , and Tor M. Aamodt . 2012. Cache-conscious wavefront scheduling . In Proceedings of the 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83 . Timothy G. Rogers, Mike O\u2019Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72--83."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736055"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW\u201910)","author":"Soman Jyothish","unstructured":"Jyothish Soman , Kothapalli Kishore , and P. J. Narayanan . 2010. A fast GPU algorithm for graph connectivity . In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW\u201910) . IEEE, 1--8. Jyothish Soman, Kothapalli Kishore, and P. J. Narayanan. 2010. A fast GPU algorithm for graph connectivity. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW\u201910). IEEE, 1--8."},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the Conference on High Performance Graphics. ACM, 167--171","author":"Vineet Vibhav","unstructured":"Vibhav Vineet , Pawan Harish , Suryakant Patidar , and P. J. Narayanan . 2009. Fast minimum spanning tree for large graphs on the GPU . In Proceedings of the Conference on High Performance Graphics. ACM, 167--171 . Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. 2009. Fast minimum spanning tree for large graphs on the GPU. In Proceedings of the Conference on High Performance Graphics. ACM, 167--171."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3293883.3295733"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3108140"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358318"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173197"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304029"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.111"},{"key":"e_1_2_1_53_1","volume-title":"Proceedings of International Parallel and Distributed Processing Symposium Workshop. IEEE, 129--136","author":"Zhou Shijie","unstructured":"Shijie Zhou , Charalampos Chelmis , and Viktor K. Prasanna . 2015. Accelerating large-scale single-source shortest path on FPGA . In Proceedings of International Parallel and Distributed Processing Symposium Workshop. IEEE, 129--136 . Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Accelerating large-scale single-source shortest path on FPGA. In Proceedings of International Parallel and Distributed Processing Symposium Workshop. IEEE, 129--136."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3450440","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3450440","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:18:45Z","timestamp":1750191525000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3450440"}},"subtitle":["Accelerating Graph Processing on GPUs"],"short-title":[],"issued":{"date-parts":[[2021,5,10]]},"references-count":53,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9,30]]}},"alternative-id":["10.1145\/3450440"],"URL":"https:\/\/doi.org\/10.1145\/3450440","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,10]]},"assertion":[{"value":"2020-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-05-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}