{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T15:13:54Z","timestamp":1759331634031,"version":"build-2065373602"},"reference-count":58,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2025,11,30]]},"abstract":"<jats:p>Hardware reliability has emerged as a paramount concern for machine learning accelerators, as transient errors and permanent failures occurring during inference can severely compromise accuracy, performance, and service availability. Although fault resilience in traditional machine learning, such as Deep Neural Networks (DNNs), has been extensively studied, graph convolutional networks (GCNs) present unique reliability challenges due to their irregular computation patterns and dynamic data dependencies. Traditional fault mitigation approaches, including hardware redundancy, recomputation, and Hamming code protection, suffer from prohibitive latency and power overheads when applied to GCN accelerators. This article presents FORT-GCN, a holistic hardware architecture co-optimized for GCN-specific fault resilience. Our solution integrates three key innovations, namely permanent fault tolerance through a novel robust processing element design with runtime reconfiguration and defect-adaptive interconnects, transient error resilience via lightweight selective error correction unit design, and a fault-aware adaptive controller design that dynamically adjusts fault protection strategies based on operational faults and graph characteristics. Experimental evaluation demonstrates 35.4% improvement in fault robustness compared to conventional error-correction and redundancy-based approaches, with minimal timing, area, and power overheads.<\/jats:p>","DOI":"10.1145\/3758094","type":"journal-article","created":{"date-parts":[[2025,8,2]],"date-time":"2025-08-02T11:18:12Z","timestamp":1754133492000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["FORT-GCN: A\n            <u>F<\/u>\n            ault-T\n            <u>o<\/u>\n            le\n            <u>r<\/u>\n            ant and Adap\n            <u>t<\/u>\n            ive Accelerator Design for Efficient Graph Convolutional Network Inference"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7189-9293","authenticated-orcid":false,"given":"Ke","family":"Wang","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering, The University of North Carolina at Charlotte","place":["Charlotte, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-5776-6239","authenticated-orcid":false,"given":"Yingnan","family":"Zhao","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, The George Washington University","place":["Washington, United States"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4262-6688","authenticated-orcid":false,"given":"Ahmed","family":"Louri","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, The George Washington University","place":["Washington, United States"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,26]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.3390\/ijgi10070485"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.5555\/1950815.1950906"},{"key":"e_1_3_1_4_2","first-page":"1","volume-title":"Proceedings of the 2016 10th IEEE\/ACM International Symposium on Networks-on-Chip (NOCS)","author":"Chen Xiaowen","year":"2016","unstructured":"Xiaowen Chen, Zhonghai Lu, Yuanwu Lei, Yaohua Wang, and Shenggang Chen. 2016. Multi-bit transient fault control for NoC links using 2D fault coding method. In Proceedings of the 2016 10th IEEE\/ACM International Symposium on Networks-on-Chip (NOCS). IEEE, 1\u20138."},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2019.12.130"},{"key":"e_1_3_1_6_2","first-page":"593","volume-title":"Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE)","author":"Deng Jiacnao","year":"2015","unstructured":"Jiacnao Deng, Yuntan Fang, Zidong Du, Ymg Wang, Huawei Li, Olivier Temam, Paolo Ienne, David Novo, Xiaowei Li, Yunji Chen, et\u00a0al. 2015. Retraining-based timing error mitigation for hardware neural networks. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 593\u2013596."},{"key":"e_1_3_1_7_2","first-page":"320","volume-title":"Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA\u201914)","author":"DiTomaso Dominic","year":"2014","unstructured":"Dominic DiTomaso, Avinash Kodi, and Ahmed Louri. 2014. QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA\u201914). 320\u2013331."},{"key":"e_1_3_1_8_2","unstructured":"Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch geometric. Retrieved from https:\/\/arxiv.org\/abs\/1903.02428"},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","first-page":"922","DOI":"10.1109\/MICRO50266.2020.00079","volume-title":"Proceedings of the 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO)","author":"Geng Tong","year":"2020","unstructured":"Tong Geng, Ang Li, Runbin Shi, Chunshu Wu, Tianqi Wang, Yanfei Li, Pouya Haghi, Antonino Tumeo, Shuai Che, Steve Reinhardt, et\u00a0al. 2020. AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing. In Proceedings of the 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). IEEE, 922\u2013936."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480113"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579371.3589105"},{"key":"e_1_3_1_12_2","unstructured":"Mark Horowitz. 2014. Energy table for 45nm process. In Stanford VLSI Wiki."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2006.876103"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3285215"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSI.2020.3010743"},{"key":"e_1_3_1_16_2","unstructured":"Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. Retrieved from https:\/\/arxiv.org\/abs\/1609.02907"},{"issue":"10","key":"e_1_3_1_17_2","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1109\/C-M.1976.218410","article-title":"Special feature: Semiconductor memory reliability with error detecting and correcting codes","volume":"9","author":"Levine Len","year":"1976","unstructured":"Len Levine and Ware Meyers. 1976. Special feature: Semiconductor memory reliability with error detecting and correcting codes. Computer 9, 10 (1976), 43\u201350.","journal-title":"Computer"},{"key":"e_1_3_1_18_2","first-page":"1","volume-title":"Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics","author":"Li Bingjun","year":"2021","unstructured":"Bingjun Li, Tianyu Wang, and Sheida Nabavi. 2021. Cancer molecular subtype classification by graph convolutional networks on multi-omics data. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. 1\u20139."},{"key":"e_1_3_1_19_2","first-page":"775","volume-title":"Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)","author":"Li Jiajun","year":"2021","unstructured":"Jiajun Li, Ahmed Louri, Avinash Karanth, and Razvan Bunescu. 2021. GCNAX: A flexible and energy-efficient accelerator for graph convolutional neural networks. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 775\u2013788."},{"key":"e_1_3_1_20_2","first-page":"2834","article-title":"SGCNAX: A scalable graph convolutional neural network accelerator with workload balancing","author":"Li Jiajun","year":"2022","unstructured":"Jiajun Li, Hao Zheng, Ke Wang, and Ahmed Louri. 2022. SGCNAX: A scalable graph convolutional neural network accelerator with workload balancing. IEEE Transactions on Parallel & Distributed Systems 33, 11 (2021), 2834\u20132845.","journal-title":"IEEE Transactions on Parallel & Distributed Systems"},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1109\/ITC-Asia.2019.00039","volume-title":"Proceedings of the 2019 IEEE International Test Conference in Asia (ITC-Asia)","author":"Li Li","year":"2019","unstructured":"Li Li, Dawen Xu, Kouzi Xing, Cheng Liu, Ying Wang, Huawei Li, and Xiaowei Li. 2019. Squeezing the last MHz for CNN acceleration on FPGAs. In Proceedings of the 2019 IEEE International Test Conference in Asia (ITC-Asia). IEEE, 151\u2013156."},{"key":"e_1_3_1_22_2","first-page":"1","volume-title":"Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN)","author":"Li Yangyang","year":"2021","unstructured":"Yangyang Li, Yipeng Ji, Shaoning Li, Shulong He, Yinhao Cao, Yifeng Liu, Hong Liu, Xiong Li, Jun Shi, and Yangchao Yang. 2021. Relevance-aware anomalous users detection in social network via graph neural network. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1\u20138."},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2020.3040772"},{"key":"e_1_3_1_24_2","first-page":"1","volume-title":"Proceedings of the 2019 IEEE\/ACM International Conference on Computer-Aided Design (ICCAD)","author":"Liu Tao","year":"2019","unstructured":"Tao Liu and Wujie Wen. 2019. Making the fault-tolerance of emerging neural network accelerators scalable. In Proceedings of the 2019 IEEE\/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1\u20135."},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3340531.3411894"},{"issue":"2","key":"e_1_3_1_26_2","doi-asserted-by":"crossref","first-page":"21","DOI":"10.1109\/MDT.1985.294856","article-title":"Built-in self-test techniques","volume":"2","author":"McCluskey Edward J.","year":"1985","unstructured":"Edward J. McCluskey. 1985. Built-in self-test techniques. IEEE Design & Test of Computers 2, 2 (1985), 21\u201328.","journal-title":"IEEE Design & Test of Computers"},{"key":"e_1_3_1_27_2","article-title":"CACTI 6.0: A tool to understand large caches","author":"Muralimanohar Naveen","year":"2009","unstructured":"Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to understand large caches. HP Laboratories 27 (2009), 28.","journal-title":"HP Laboratories"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/24.994913"},{"key":"e_1_3_1_29_2","doi-asserted-by":"crossref","first-page":"590","DOI":"10.1109\/TEST.2002.1041810","volume-title":"Proceedings of the International Test Conference","author":"Parvathala Praveen","year":"2002","unstructured":"Praveen Parvathala, Kaila Maneparambil, and William Lindsay. 2002. FRITS-a microprocessor functional BIST method. In Proceedings of the International Test Conference. IEEE, 590\u2013598."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3592798"},{"issue":"3","key":"e_1_3_1_31_2","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1145\/3007787.3001165","article-title":"Minerva: Enabling low-power, highly-accurate deep neural network accelerators","volume":"44","author":"Reagen Brandon","year":"2016","unstructured":"Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Gu-Yeon Wei, and David Brooks. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. ACM SIGARCH Computer Architecture News 44, 3 (2016), 267\u2013278.","journal-title":"ACM SIGARCH Computer Architecture News"},{"key":"e_1_3_1_32_2","first-page":"322","volume-title":"Proceedings of the 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","author":"Salami Behzad","year":"2018","unstructured":"Behzad Salami, Osman S. Unsal, and Adrian Cristal Kestelman. 2018. On the resilience of RTL NN accelerators: Fault characterization and mitigation. In Proceedings of the 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 322\u2013329."},{"volume-title":"Proceedings of the International Symposium on Applied Reconfigurable Computing (ARC\u201917)","year":"2017","key":"e_1_3_1_33_2","unstructured":"Andr\u00e9 Flores dos Santos, Lucas Antunes Tambara, Fabio Benevenuti, Jorge Tonfat, and Fernanda Lima Kastensmidt. 2017. Applying TMR in hardware accelerators generated by high-level synthesis design flow for mitigating multiple bit upsets in SRAM-based FPGAs. In Proceedings of the International Symposium on Applied Reconfigurable Computing (ARC\u201917). Springer."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSM.2007.913186"},{"key":"e_1_3_1_35_2","doi-asserted-by":"crossref","first-page":"1099","DOI":"10.1109\/HPCA56546.2023.10071015","volume-title":"Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)","author":"Sarkar Rishov","year":"2023","unstructured":"Rishov Sarkar, Stefan Abi-Karam, Yuqi He, Lakshmi Sathidevi, and Cong Hao. 2023. FlowGNN: A dataflow architecture for real-time workload-agnostic graph neural network inference. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1099\u20131112."},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1109\/FTCS.1999.781052","volume-title":"Digest of Papers. Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing (Cat. No. 99CB36352)","author":"Steininger Andreas","year":"1999","unstructured":"Andreas Steininger and Christoph Scherrer. 1999. On the necessity of on-line-BIST in safety-critical applications-a case-study. In Digest of Papers. Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing (Cat. No. 99CB36352). IEEE, 208\u2013215."},{"key":"e_1_3_1_37_2","first-page":"110","volume-title":"Proceedings of the 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC)","author":"Takanami Itsuo","year":"2017","unstructured":"Itsuo Takanami and Masaru Fukushi. 2017. A built-in circuit for self-repairing mesh-connected processor arrays with spares on diagonal. In Proceedings of the 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 110\u2013117."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/PRDC.2012.11"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2017.2742698"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2024724.2024929"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.3003576"},{"key":"e_1_3_1_42_2","first-page":"1","volume-title":"Proceedings of the International Conference on Internet Multimedia Computing and Service (ICIMCS\u201915)","author":"Wang Ruochen","year":"2015","unstructured":"Ruochen Wang and Zhe Xu. 2015. A pedestrian and vehicle rapid identification model based on convolutional neural network. In Proceedings of the International Conference on Internet Multimedia Computing and Service (ICIMCS\u201915). 1\u20134."},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.5555\/2971808.2972080"},{"key":"e_1_3_1_44_2","unstructured":"Le Wu Peijie Sun Richang Hong Yanjie Fu Xiting Wang and Meng Wang. 2019. SocialGCN: An efficient graph convolutional network based model for social recommendation. https:\/\/arxiv.org\/abs\/1811.02815"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2997722"},{"key":"e_1_3_1_46_2","first-page":"99","volume-title":"Proceedings of the 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2160","author":"Xu Dawen","year":"2019","unstructured":"Dawen Xu, Kouzi Xing, Cheng Liu, Ying Wang, Yulin Dai, Long Cheng, Huawei Li, and Lei Zhang. 2019. Resilient neural network training for accelerators with computing errors. In Proceedings of the 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Vol. 2160. IEEE, 99\u2013102."},{"key":"e_1_3_1_47_2","first-page":"15","volume-title":"Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","author":"Yan Mingyu","year":"2020","unstructured":"Mingyu Yan, Lei Deng, Xing Hu, Ling Liang, Yujing Feng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, and Yuan Xie. 2020. HyGCN: A GCN accelerator with hybrid architecture. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 15\u201329."},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3219819.3219890"},{"key":"e_1_3_1_49_2","first-page":"460","volume-title":"Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)","author":"You Haoran","year":"2022","unstructured":"Haoran You, Tong Geng, Yongan Zhang, Ang Li, and Yingyan Lin. 2022. GCoD: Graph convolutional network acceleration via dedicated algorithm and accelerator co-design. In Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 460\u2013474."},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2019.2915656"},{"key":"e_1_3_1_51_2","unstructured":"Susan Zhang Stephen Roller Naman Goyal Mikel Artetxe Moya Chen Shuohui Chen Christopher Dewan Mona Diab Xian Li Xi Victoria Lin Todor Mihaylov Myle Ott Sam Shleifer Kurt Shuster Daniel Simig Punit Singh Koura Anjali Sridhar Tianlu Wang and Luke Zettlemoyer. 2022. OPT: Open pre-trained transformer language models. arXiv:2205.01068. Retrieved from https:\/\/arxiv.org\/abs\/2205.01068"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3485447.3512159"},{"key":"e_1_3_1_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/TITS.2019.2935152"},{"key":"e_1_3_1_54_2","first-page":"545","volume-title":"Proceedings of the 2022 IEEE 40th International Conference on Computer Design (ICCD)","author":"Zhao Yingnan","year":"2022","unstructured":"Yingnan Zhao, Ke Wang, and Ahmed Louri. 2022. FSA: An efficient fault-tolerant systolic array-based DNN accelerator architecture. In Proceedings of the 2022 IEEE 40th International Conference on Computer Design (ICCD). IEEE, 545\u2013552."},{"key":"e_1_3_1_55_2","doi-asserted-by":"crossref","unstructured":"Yingnan Zhao Ke Wang and Ahmed Louri. 2024. OPT-GCN: A unified and scalable chiplet-based accelerator for high-performance and energy-efficient GCN computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43 12 (2024) 4827\u20134840.","DOI":"10.1109\/TCAD.2024.3401543"},{"key":"e_1_3_1_56_2","article-title":"HS-GCN: A high-performance, sustainable, and scalable chiplet-based accelerator for graph convolutional network inference","author":"Zhao Yingnan","year":"2025","unstructured":"Yingnan Zhao, Ke Wang, and Ahmed Louri. 2025. HS-GCN: A high-performance, sustainable, and scalable chiplet-based accelerator for graph convolutional network inference. IEEE Transactions on Sustainable Computing (2025). https:\/\/ieeexplore.ieee.org\/abstract\/document\/11018459?casa_token=i9zqhor5di4AAAAA:ORUe_gyQJ-KXEfYcIR8dBA_8D711N4nCsONa1XGHA7o1r_QK2rJxbG3PFdwblh0xQWrEn1lQYA","journal-title":"IEEE Transactions on Sustainable Computing"},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449835"},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2019.2935042"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330851"}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3758094","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,30]],"date-time":"2025-09-30T13:45:12Z","timestamp":1759239912000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3758094"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,26]]},"references-count":58,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2025,11,30]]}},"alternative-id":["10.1145\/3758094"],"URL":"https:\/\/doi.org\/10.1145\/3758094","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2025,9,26]]},"assertion":[{"value":"2025-07-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-07-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-26","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}