{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,13]],"date-time":"2026-02-13T10:50:19Z","timestamp":1770979819489,"version":"3.50.1"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,11,18]],"date-time":"2024-11-18T00:00:00Z","timestamp":1731888000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Key Research and Development Program of China","award":["2021YFA1003602"],"award-info":[{"award-number":["2021YFA1003602"]}]},{"name":"Shanghai Pujiang Program","award":["22PJD003"],"award-info":[{"award-number":["22PJD003"]}]},{"name":"CFFF platform of Fudan University"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2024,12,31]]},"abstract":"<jats:p>\n            Field-programmable gate arrays (FPGAs) are an ideal candidate for accelerating graph neural networks (GNNs). However, the FPGA redeployment process is time-consuming when updating or switching between diverse GNN models across different applications. Existing GNN processors eliminate the need for FPGA redeployment when switching between different GNN models. However, adapting matrix multiplication types by switching processing units decreases hardware utilization. In addition, the bandwidth of DDR limits further improvements in hardware performance. This article proposes a highly flexible FPGA-based overlay processor for GNN accelerations. Graph-OPU provides excellent flexibility and programmability for users, as the executable code of GNN models is automatically compiled and reloaded without requiring FPGA redeployment. First, we customize the compiler and instruction sets for the inference process of different GNN models. Second, we customize the datapath and optimize the data format in the microarchitecture to fully leverage the advantages of high bandwidth memory (HBM). Third, we design a unified matrix multiplication to handle both sparse-dense matrix multiplication (SpMM) and general matrix multiplication (GEMM), enhancing Graph-OPU performance. During Graph-OPU execution, the computational units are shared between SpMM and GEMM instead of being switched, which improves the hardware utilization. Finally, we implement a hardware prototype on the Xilinx Alveo U50 and test the mainstream GNN models using various datasets. Experimental results show that Graph-OPU achieves up to 1,654\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            and 63\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            speedup, as well as up to 5,305\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            and 422\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            energy efficiency boosts, compared to implementations on CPU and GPU, respectively. Graph-OPU outperforms state-of-the-art (SOTA) end-to-end overlay accelerators for GNN, reducing latency by an average of 1.36\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            and improving energy efficiency by 1.41\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            on average. Moreover, Graph-OPU exhibits an average 1.45\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            speed improvement in end-to-end latency over the SOTA GNN processor. Graph-OPU represents an in-depth study of an FPGA-based overlay processor for GNNs, offering high flexibility, speedup, and energy efficiency.\n          <\/jats:p>","DOI":"10.1145\/3691636","type":"journal-article","created":{"date-parts":[[2024,9,2]],"date-time":"2024-09-02T14:31:09Z","timestamp":1725287469000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Graph-OPU: A Highly Flexible FPGA-Based Overlay Processor for Graph Neural Networks"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0539-8885","authenticated-orcid":false,"given":"Enhao","family":"Tang","sequence":"first","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9003-8966","authenticated-orcid":false,"given":"Shun","family":"Li","sequence":"additional","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6837-5675","authenticated-orcid":false,"given":"Ruiqi","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-8421-1242","authenticated-orcid":false,"given":"Hao","family":"Zhou","sequence":"additional","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0958-6397","authenticated-orcid":false,"given":"Yuhanxiao","family":"Ma","sequence":"additional","affiliation":[{"name":"New York University, New York, NY, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4496-4752","authenticated-orcid":false,"given":"Haoyang","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4286-9292","authenticated-orcid":false,"given":"Jun","family":"Yu","sequence":"additional","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai Shi, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7288-1789","authenticated-orcid":false,"given":"Kun","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Microelectronics, Fudan University, Shanghai Shi, China"}]}],"member":"320","published-online":{"date-parts":[[2024,11,18]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477141"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-021-00418-8"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM57271.2023.00049"},{"key":"e_1_3_1_5_2","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR2014 \u201914)","author":"Bruna Joan","year":"2014","unstructured":"Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of the International Conference on Learning Representations (ICLR2014 \u201914), CBLS, April 2014."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL60245.2023.00039"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00092"},{"key":"e_1_3_1_8_2","first-page":"203","volume-title":"Proceedings of the 5th Conference on Robot Learning","volume":"164","author":"Deo Nachiket","year":"2022","unstructured":"Nachiket Deo, Eric Wolff, and Oscar Beijbom. 2022. Multimodal trajectory prediction conditioned on lane-graph traversals. In Proceedings of the 5th Conference on Robot Learning, PMLR 164 (2022), 203\u2013212. Retrieved from https:\/\/proceedings.mlr.press\/v164\/deo22a.html"},{"key":"e_1_3_1_9_2","volume-title":"Proceedings of the ICLR Workshop on Representation Learning on Graphs and Manifolds","author":"Fey Matthias","year":"2019","unstructured":"Matthias Fey and Jan E. Lenssen. 2019. Fast graph representation Learning with PyTorch geometric. In Proceedings of the ICLR Workshop on Representation Learning on Graphs and Manifolds."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO50266.2020.00079"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480113"},{"key":"e_1_3_1_12_2","unstructured":"GroqInc. 2020. The Challenge of Batch Size 1: Groq Adds Responsiveness to Inference Performance. Whitepaper. Retrieved from www.groq.com"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289185"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.5555\/3294771.3294869"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401063"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW52791.2021.00030"},{"key":"e_1_3_1_17_2","first-page":"22118","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"33","author":"Hu Weihua","year":"2020","unstructured":"Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 22118\u201322133. Retrieved from https:\/\/papers.neurips.cc\/paper_files\/paper\/2020\/hash\/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html"},{"key":"e_1_3_1_18_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Kipf Thomas N.","year":"2017","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=SJU4ayYgl"},{"key":"e_1_3_1_19_2","first-page":"1","volume-title":"Proceedings of the IEEE\/ACM International Conference On Computer Aided Design (ICCAD \u201920)","author":"Liang Shengwen","year":"2020","unstructured":"Shengwen Liang, Cheng Liu, Ying Wang, Huawei Li, and Xiaowei Li. 2020. DeepBurning-GL: An automated framework for generating graph neural network accelerators. In Proceedings of the IEEE\/ACM International Conference On Computer Aided Design (ICCAD \u201920), 1\u20139."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC49654.2021.9622801"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevD.101.056019"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA56546.2023.10071015"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2008.2005605"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2004.27"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/2304576.2304624"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3550075"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2022.06.010"},{"key":"e_1_3_1_28_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Veli\u010dkovi\u0107 Petar","year":"2018","unstructured":"Petar Veli\u010dkovi\u0107, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=rJXMpikCZ"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449786"},{"key":"e_1_3_1_30_2","unstructured":"Shaopeng Wei Yu Zhao Xingyan Chen Qing Li Fuzhen Zhuang Ji Liu Fuji Ren and Gang Kou. 2023. Graph learning and its advancements on large language models: A holistic survey. arXiv:2212.08966 [cs.AI]. Retrieved from https:\/\/arxiv.org\/abs\/2212.08966"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/1362622.1362674"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL57034.2022.00073"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2978386"},{"key":"e_1_3_1_34_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Xu Keyulu","year":"2019","unstructured":"Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful are graph neural networks?. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=ryGs6iA5Km"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM48280.2020.00014"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2020.2970395"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00012"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2019.2939726"},{"key":"e_1_3_1_39_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Zeng Hanqing","year":"2020","unstructured":"Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. GraphSAINT: Graph sampling based inductive learning method. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=BJe8pkHFwS"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM51124.2021.00012"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP49362.2020.00019"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2023.3287883"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3691636","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3691636","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:09:40Z","timestamp":1750295380000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3691636"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,18]]},"references-count":41,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,12,31]]}},"alternative-id":["10.1145\/3691636"],"URL":"https:\/\/doi.org\/10.1145\/3691636","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,18]]},"assertion":[{"value":"2023-09-20","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-19","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}