{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,2]],"date-time":"2026-07-02T23:37:42Z","timestamp":1783035462266,"version":"3.54.6"},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,2,18]],"date-time":"2021-02-18T00:00:00Z","timestamp":1613606400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["1717532"],"award-info":[{"award-number":["1717532"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2021,2,18]]},"abstract":"<jats:p>As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.<\/jats:p>","DOI":"10.1145\/3447375","type":"journal-article","created":{"date-parts":[[2021,2,22]],"date-time":"2021-02-22T22:23:33Z","timestamp":1614032613000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["SUGAR"],"prefix":"10.1145","volume":"5","author":[{"given":"Lishan","family":"Yang","sequence":"first","affiliation":[{"name":"William &amp; Mary, Williamsburg, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bin","family":"Nie","sequence":"additional","affiliation":[{"name":"William &amp; Mary, Williamsburg, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Adwait","family":"Jog","sequence":"additional","affiliation":[{"name":"William &amp; Mary, Williamsburg, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Evgenia","family":"Smirni","sequence":"additional","affiliation":[{"name":"William &amp; Mary, Williamsburg, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,2,22]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"[n. d.]. CUDA-GDB. http:\/\/docs.nvidia.com\/cuda\/cuda-gdb\/#axzz4PHxjHEUB"},{"key":"e_1_2_1_2_1","unstructured":"[n. d.]. GP100 Pascal Whitepaper. https:\/\/images.nvidia.com\/content\/pdf\/tesla\/whitepaper\/pascal-architecturewhitepaper. pdf"},{"key":"e_1_2_1_3_1","unstructured":"[n. d.]. NVBitFI. https:\/\/github.com\/NVlabs\/nvbitfi."},{"key":"e_1_2_1_4_1","unstructured":"[n. d.]. NVIDIA Fermi Architecture Whitepaper. http:\/\/www.nvidia.com\/content\/pdf\/fermi_white_papers\/nvidia_ fermi_compute_architecture_whitepaper.pdf"},{"key":"e_1_2_1_5_1","unstructured":"[n. d.]. NVIDIA Kepler GK110 Architecture Whitepaper."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2018.00066"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-14325-5_47"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356177"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463209.2488859"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2009.4798244"},{"key":"e_1_2_1_13_1","volume-title":"Medical image processing on the GPU--Past, present and future. Medical image analysis 17, 8","author":"Eklund Anders","year":"2013","unstructured":"Anders Eklund, Paul Dufort, Daniel Forsberg, and Stephen M LaConte. 2013. Medical image processing on the GPU--Past, present and future. Medical image analysis 17, 8 (2013), 1073--1094."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2014.6844486"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2018.00015"},{"key":"e_1_2_1_16_1","unstructured":"Qian Gong Phil DeMar and Wenji Wu. 2017. Deep Packet\/Flow Analysis using GPUs. Technical Report."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339595"},{"key":"e_1_2_1_18_1","volume-title":"Relyzer: Exploiting applicationlevel fault equivalence to analyze application resiliency to transient faults. In ACM SIGPLAN Notices","author":"Sastry Hari Siva Kumar","year":"2012","unstructured":"Siva Kumar Sastry Hari, Sarita V Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting applicationlevel fault equivalence to analyze application resiliency to transient faults. In ACM SIGPLAN Notices, Vol. 47. ACM, 123--134."},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the Workshop on Silicon Errors in Logic-System Effects.","author":"Sastry Hari Siva Kumar","year":"2015","unstructured":"Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, StephenWKeckler, and Joel Emer. 2015. SASSIFI: Evaluating resilience of GPU applications. In Proceedings of the Workshop on Silicon Errors in Logic-System Effects."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.59"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2019.00025"},{"key":"e_1_2_1_22_1","volume-title":"Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das.","author":"Jog Adwait","year":"2013","unstructured":"Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080225"},{"key":"e_1_2_1_24_1","unstructured":"David B Kirk and W Hwu Wen-Mei. 2016. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2018.00038"},{"key":"e_1_2_1_26_1","volume-title":"SC16: International Conference for. IEEE, 240--251","author":"Li Guanpeng","year":"2016","unstructured":"Guanpeng Li, Karthik Pattabiraman, Chen-Yang Cher, and Pradip Bose. 2016. Understanding error propagation in GPGPU applications. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 240--251."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2018.00016"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","unstructured":"Abdulrahman Mahmoud Neeraj Aggarwal Alex Nobbe Jose Vicarte Sarita Adve Christopher Fletcher Iuri Frosio and Siva Hari. 2020. PyTorchFI: A Runtime Perturbation Tool for DNNs. 25--31. https:\/\/doi.org\/10.1109\/DSNW50199.2020.00014","DOI":"10.1109\/DSNW50199.2020.00014"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304050"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2012.6237024"},{"key":"e_1_2_1_31_1","volume-title":"Characterizing Accuracy-Aware Resilience of GPGPU Applications. In 20th IEEE\/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020","author":"Nie Bin","year":"2020","unstructured":"Bin Nie, Adwait Jog, and Evgenia Smirni. 2020. Characterizing Accuracy-Aware Resilience of GPGPU Applications. In 20th IEEE\/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020, Melbourne, Australia, May 11--14, 2020. IEEE, 111--120."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446091"},{"key":"e_1_2_1_33_1","volume-title":"MASCOTS","author":"Nie Bin","year":"2017","unstructured":"Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. [n. d.]. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In MASCOTS 2017. 22--31."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2018.00022"},{"key":"e_1_2_1_35_1","volume-title":"Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications. In 2018 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). IEEE, 749--761","author":"Nie Bin","year":"2018","unstructured":"Bin Nie, Lishan Yang, Adwait Jog, and Evgenia Smirni. 2018. Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications. In 2018 51st Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). IEEE, 749--761."},{"key":"e_1_2_1_36_1","unstructured":"NVIDIA. [n. d.]. Computational Finance. http:\/\/www.nvidia.com\/object\/computational_finance.html"},{"key":"e_1_2_1_37_1","unstructured":"NVIDIA. [n. d.]. Researchers Deploy GPUs to Build World's Largest Artificial Neural Network. https:\/\/nvidianews. nvidia.com\/news\/researchers-deploy-gpus-to-build-world-s-largest-artificial-neural-network"},{"key":"e_1_2_1_38_1","unstructured":"NVIDIA. 2011. CUDA C\/C++ SDK Code Samples. http:\/\/developer.nvidia.com\/cuda-cc-sdk-code-samples"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/IEDM.2008.4796702"},{"key":"e_1_2_1_40_1","volume-title":"GPU computing in medical physics: A review. Medical physics 38, 5","author":"Pratx Guillem","year":"2011","unstructured":"Guillem Pratx and Lei Xing. 2011. GPU computing in medical physics: A review. Medical physics 38, 5 (2011), 2685--2697."},{"key":"e_1_2_1_41_1","volume-title":"Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 308--311","author":"Previlon Fritz G","year":"2019","unstructured":"Fritz G Previlon, Charu Kalra, Devesh Tiwari, and David R Kaeli. 2019. PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 308--311."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2017.30"},{"key":"e_1_2_1_43_1","unstructured":"Hamid Sarbazi-Azad. 2016. Advances in GPU Research and Practice. Morgan Kaufmann."},{"key":"e_1_2_1_44_1","volume-title":"Wall street accelerates options analysis with GPU technology. Wall Street Technology 11","author":"Schmerken I","year":"2009","unstructured":"I Schmerken. 2009. Wall street accelerates options analysis with GPU technology. Wall Street Technology 11 (2009)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1366224.1366225"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2009.4798243"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2008.05.013"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2016.7482077"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.5555\/3195638.3195689"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358307"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00062"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00114"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2020.2980541"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2016.2630270"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.36"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447375","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3447375","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3447375","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:46:56Z","timestamp":1750193216000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447375"}},"subtitle":["Speeding Up GPGPU Application Resilience Estimation with Input Sizing"],"short-title":[],"issued":{"date-parts":[[2021,2,18]]},"references-count":55,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,2,18]]}},"alternative-id":["10.1145\/3447375"],"URL":"https:\/\/doi.org\/10.1145\/3447375","relation":{},"ISSN":["2476-1249"],"issn-type":[{"value":"2476-1249","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,2,18]]},"assertion":[{"value":"2021-02-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}