{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:10:10Z","timestamp":1750194610287,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,10,22]],"date-time":"2021-10-22T00:00:00Z","timestamp":1634860800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"US National Science Foundation","doi-asserted-by":"crossref","award":["CCF-1815467"],"award-info":[{"award-number":["CCF-1815467"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Emerg. Technol. Comput. Syst."],"published-print":{"date-parts":[[2022,1,31]]},"abstract":"<jats:p>\n            Recent advances in GPU-based manycore accelerators provide the opportunity to efficiently process large-scale graphs on chip. However, real world graphs have a diverse range of topology and connectivity patterns (e.g., degree distributions) that make the design of input-agnostic hardware architectures a challenge.\n            <jats:bold>Network-on-Chip (NoC)-<\/jats:bold>\n            based architectures provide a way to overcome this challenge as the architectural topology can be used to approximately model the expected traffic patterns that emerge from graph application workloads. In this paper, we first study the mix of long- and short-range traffic patterns generated on-chip using graph workloads, and subsequently use the findings to adapt the design of an optimal NoC-based architecture. In particular, by leveraging emerging\n            <jats:bold>three-dimensional (3D)<\/jats:bold>\n            integration technology, we propose design of a\n            <jats:bold>small-world NoC (SWNoC)-<\/jats:bold>\n            enabled manycore GPU architecture, where the placement of the links connecting the\n            <jats:bold>streaming multiprocessors (SM)<\/jats:bold>\n            and the\n            <jats:bold>memory controllers (MC)<\/jats:bold>\n            follow a power-law distribution. The proposed 3D manycore GPU architecture outperforms the traditional planar (2D) counterparts in both performance and energy consumption. Moreover, by adopting a joint performance-thermal optimization strategy, we address the thermal concerns in a 3D design without noticeably compromising the achievable performance. The 3D integration technology is also leveraged to incorporate\n            <jats:bold>Near Data Processing (NDP)<\/jats:bold>\n            to complement the performance benefits introduced by the SWNoC architecture. As graph applications are inherently memory intensive, off-chip data movement gives rise to latency and energy overheads in the presence of external DRAM. In conventional GPU architectures, as the main memory layer is not integrated with the logic, off-chip data movement negatively impacts overall performance and energy consumption. We demonstrate that NDP significantly reduces the overheads associated with such frequent and irregular memory accesses in graph-based applications. The proposed SWNoC-enabled NDP framework that integrates 3D memory (like Micron's HMC) with a massive number of GPU cores achieves 29.5% performance improvement and 30.03% less energy consumption on average compared to a conventional planar Mesh-based design with external DRAM.\n          <\/jats:p>","DOI":"10.1145\/3482880","type":"journal-article","created":{"date-parts":[[2021,10,23]],"date-time":"2021-10-23T00:20:36Z","timestamp":1634948436000},"page":"1-19","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["High-Performance and Energy-Efficient 3D Manycore GPU Architecture for Accelerating Graph Analytics"],"prefix":"10.1145","volume":"18","author":[{"given":"Dwaipayan","family":"Choudhury","sequence":"first","affiliation":[{"name":"Washington State University, Pullman, WA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Aravind Sukumaran","family":"Rajam","sequence":"additional","affiliation":[{"name":"Washington State University, Pullman, WA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ananth","family":"Kalyanaraman","sequence":"additional","affiliation":[{"name":"Washington State University, Pullman, WA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Partha Pratim","family":"Pande","sequence":"additional","affiliation":[{"name":"Washington State University, Pullman, WA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,22]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1038\/scientificamerican0503-60"},{"doi-asserted-by":"crossref","unstructured":"K. Duraisamy H. Lu P. Pande and A. Kalyanaraman. 2016. High-performance and energy-efficient network-on-chip architectures for graph analytics. ACM Transactions on Embedded Computing Systems 66. DOI:https:\/\/doi.org\/10.1145\/2961027  K. Duraisamy H. Lu P. Pande and A. Kalyanaraman. 2016. High-performance and energy-efficient network-on-chip architectures for graph analytics. ACM Transactions on Embedded Computing Systems 66. DOI:https:\/\/doi.org\/10.1145\/2961027","key":"e_1_2_1_2_1","DOI":"10.1145\/2961027"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1109\/MM.2014.55"},{"volume-title":"ACM\/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","author":"Ahn J.","key":"e_1_2_1_4_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1109\/IISWC.2013.6704684"},{"volume-title":"Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. 239\u2013252","author":"Khorasani F.","key":"e_1_2_1_6_1"},{"doi-asserted-by":"crossref","unstructured":"Z. Fu M. Personick and B. Thompson. 2014. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. DOI:https:\/\/doi.org\/10.1145\/2621934.2621936  Z. Fu M. Personick and B. Thompson. 2014. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. DOI:https:\/\/doi.org\/10.1145\/2621934.2621936","key":"e_1_2_1_7_1","DOI":"10.1145\/2621934.2621936"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1145\/2851141.2851145"},{"volume-title":"IEEE International Conference on Computer Design","year":"2009","author":"Maashri A. A.","key":"e_1_2_1_9_1"},{"volume-title":"ACM\/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA'16)","year":"2016","author":"Ozdal M. M.","key":"e_1_2_1_10_1"},{"volume-title":"49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO'16)","year":"2016","author":"Ham T. J.","key":"e_1_2_1_11_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_12_1","DOI":"10.1109\/CCGRID.2017.114"},{"unstructured":"http:\/\/www.sommer.jp\/graphs\/.  http:\/\/www.sommer.jp\/graphs\/.","key":"e_1_2_1_13_1"},{"volume-title":"IEEE International Symposium on High Performance Computer Architecture (HPCA'17)","year":"2017","author":"Nai L.","key":"e_1_2_1_14_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1109\/HPCA.2018.00053"},{"volume-title":"26th International Conference on Parallel Architectures and Compilation Techniques (PACT'16)","year":"2017","author":"Fujiki D.","key":"e_1_2_1_16_1"},{"volume-title":"IEEE International Symposium on High Performance Computer Architecture (HPCA'16)","year":"2018","author":"Song L.","key":"e_1_2_1_17_1"},{"volume-title":"GraphSAR. In Proceedings of the 24th Asia and South Pacific Design Automation Conference. DOI:https:\/\/doi.org\/10","author":"Dai G.","key":"e_1_2_1_18_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_19_1","DOI":"10.1109\/IPDPS47924.2020.00077"},{"doi-asserted-by":"crossref","unstructured":"E. Azarkhish D. Rossi I. Loi and L. Benini. 2016. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In Architecture of Computing Systems (ARCS'16). 19\u201331. DOI:https:\/\/doi.org\/10.1007\/978-3-319-30695-7_2  E. Azarkhish D. Rossi I. Loi and L. Benini. 2016. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In Architecture of Computing Systems (ARCS'16). 19\u201331. DOI:https:\/\/doi.org\/10.1007\/978-3-319-30695-7_2","key":"e_1_2_1_20_1","DOI":"10.1007\/978-3-319-30695-7_2"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.1145\/2508148.2485928"},{"volume-title":"Proceedings of the Second International Symposium on Memory Systems. DOI:https:\/\/doi.org\/10","author":"Zhu Y.","key":"e_1_2_1_22_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1145\/3130218.3130219"},{"volume-title":"10th International Symposium on Quality Electronic Design","year":"2009","author":"Alam S. M.","key":"e_1_2_1_24_1"},{"volume-title":"2008 37th International Conference on Parallel Processing","year":"2008","author":"Zhou X.","key":"e_1_2_1_25_1"},{"key":"e_1_2_1_26_1","first-page":"509","article-title":"Emergence of scaling in random networks","volume":"286","author":"Barab\u00e1si A.","year":"1999","journal-title":"Disordered Systems and Neural Networks Science"},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.1038\/s41467-019-08746-5"},{"doi-asserted-by":"publisher","key":"e_1_2_1_28_1","DOI":"10.1038\/s41467-019-09038-8"},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1109\/TVLSI.2006.878263"},{"doi-asserted-by":"crossref","unstructured":"D. J. Watts and S. H. Strogatz. 1998. Collective dynamics of 'small-world' networks. Nature 393 6684 (1998) 440\u2013442. DOI:https:\/\/doi.org\/10.1038\/30918  D. J. Watts and S. H. Strogatz. 1998. Collective dynamics of 'small-world' networks. Nature 393 6684 (1998) 440\u2013442. DOI:https:\/\/doi.org\/10.1038\/30918","key":"e_1_2_1_30_1","DOI":"10.1038\/30918"},{"volume-title":"De Los Rios","year":"2005","author":"Petermann T.","key":"e_1_2_1_31_1"},{"doi-asserted-by":"crossref","unstructured":"S. Das J. R. Doppa P. P. Pande and K. Chakrabarty. 2016. Design-space exploration and optimization of an energy-efficient and reliable 3D small-world network-on-chip. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD'16). DOI:https:\/\/doi.org\/10.1109\/TCAD.2016.2604288  S. Das J. R. Doppa P. P. Pande and K. Chakrabarty. 2016. Design-space exploration and optimization of an energy-efficient and reliable 3D small-world network-on-chip. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD'16). DOI:https:\/\/doi.org\/10.1109\/TCAD.2016.2604288","key":"e_1_2_1_32_1","DOI":"10.1109\/TCAD.2016.2604288"},{"doi-asserted-by":"crossref","unstructured":"A. Arka B. Joardar R. Kim D. Kim J. Doppa and P. Pande. 2020. HeM3D: Heterogeneous manycore architecture based on monolithic 3D vertical integration. ACM Transactions on Design Automation of Electronic Systems. DOI:https:\/\/doi.org\/10.1145\/3424239  A. Arka B. Joardar R. Kim D. Kim J. Doppa and P. Pande. 2020. HeM3D: Heterogeneous manycore architecture based on monolithic 3D vertical integration. ACM Transactions on Design Automation of Electronic Systems. DOI:https:\/\/doi.org\/10.1145\/3424239","key":"e_1_2_1_33_1","DOI":"10.1145\/3424239"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1109\/TC.2018.2889053"},{"volume-title":"IEEE\/ACM International Conference on Computer-Aided Design, Digest of Technical Papers (ICCAD'04)","year":"2004","author":"Cong J.","key":"e_1_2_1_35_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_36_1","DOI":"10.1109\/TEVC.2007.900837"},{"doi-asserted-by":"publisher","key":"e_1_2_1_37_1","DOI":"10.1109\/ISPASS.2009.4919648"},{"volume-title":"IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS'13)","year":"2013","key":"e_1_2_1_38_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_39_1","DOI":"10.1145\/2508148.2485964"},{"volume-title":"Proceedings of ICCAD","author":"Sridhar A.","key":"e_1_2_1_40_1"},{"unstructured":"http:\/\/snap.stanford.edu\/data\/egonets-Facebook.html.  http:\/\/snap.stanford.edu\/data\/egonets-Facebook.html.","key":"e_1_2_1_41_1"},{"unstructured":"http:\/\/networkrepository.com\/road-usroads.php.  http:\/\/networkrepository.com\/road-usroads.php.","key":"e_1_2_1_42_1"},{"unstructured":"http:\/\/snap.stanford.edu\/data\/github-social.html.  http:\/\/snap.stanford.edu\/data\/github-social.html.","key":"e_1_2_1_43_1"},{"unstructured":"http:\/\/snap.stanford.edu\/data\/gemsec-Deezer.html.  http:\/\/snap.stanford.edu\/data\/gemsec-Deezer.html.","key":"e_1_2_1_44_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_45_1","DOI":"10.1016\/j.vlsi.2017.12.002"},{"doi-asserted-by":"crossref","unstructured":"R. Hadidi etal 2018. Performance implications of NoCs on 3D-stacked memories: Insights from the hybrid memory cube. In IEEE ISPASS Belfast. 99\u2013108  R. Hadidi et al. 2018. Performance implications of NoCs on 3D-stacked memories: Insights from the hybrid memory cube. In IEEE ISPASS Belfast. 99\u2013108","key":"e_1_2_1_46_1","DOI":"10.1109\/ISPASS.2018.00018"},{"doi-asserted-by":"publisher","key":"e_1_2_1_47_1","DOI":"10.1145\/3223046"},{"volume-title":"Proceedings of the 26th ACM\/IEEE Design Automation Conference (DAC\u201989)","author":"Bui T.","key":"e_1_2_1_48_1"}],"container-title":["ACM Journal on Emerging Technologies in Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3482880","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3482880","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:45Z","timestamp":1750193325000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3482880"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,22]]},"references-count":48,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,1,31]]}},"alternative-id":["10.1145\/3482880"],"URL":"https:\/\/doi.org\/10.1145\/3482880","relation":{},"ISSN":["1550-4832","1550-4840"],"issn-type":[{"type":"print","value":"1550-4832"},{"type":"electronic","value":"1550-4840"}],"subject":[],"published":{"date-parts":[[2021,10,22]]},"assertion":[{"value":"2020-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-10-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}