{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T17:59:05Z","timestamp":1771955945304,"version":"3.50.1"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2021,2,9]],"date-time":"2021-02-09T00:00:00Z","timestamp":1612828800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001459","name":"Singapore Ministry of Education","doi-asserted-by":"crossref","award":["T1-251RES1818"],"award-info":[{"award-number":["T1-251RES1818"]}],"id":[{"id":"10.13039\/501100001459","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,6,30]]},"abstract":"<jats:p>\n            This article presents GRAM (&lt;underline&gt;G&lt;\/underline&gt;PU-based &lt;underline&gt;R&lt;\/underline&gt;untime &lt;underline&gt;A&lt;\/underline&gt;daption for &lt;underline&gt;M&lt;\/underline&gt;ixed-precision) a framework for the effective use of mixed precision arithmetic for CUDA programs. Our method provides a fine-grain tradeoff between output error and performance. It can create many variants that satisfy different accuracy requirements by assigning different groups of threads to different precision levels\n            <jats:italic>adaptively at runtime<\/jats:italic>\n            . To widen the range of applications that can benefit from its approximation, GRAM comes with an optional half-precision approximate math library. Using GRAM, we can trade off precision for any performance improvement of up to 540%, depending on the application and accuracy requirement.\n          <\/jats:p>","DOI":"10.1145\/3441830","type":"journal-article","created":{"date-parts":[[2021,2,10]],"date-time":"2021-02-10T14:29:54Z","timestamp":1612967394000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["GRAM"],"prefix":"10.1145","volume":"18","author":[{"given":"Nhut-Minh","family":"Ho","sequence":"first","affiliation":[{"name":"National University of Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Himeshi De","family":"silva","sequence":"additional","affiliation":[{"name":"National University of Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Weng-Fai","family":"Wong","sequence":"additional","affiliation":[{"name":"National University of Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,2,9]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/SIPS.2009.5336225"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3388785"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3093333.3009846"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316482.3326341"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/775832.775960"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/T-C.1970.223019"},{"key":"e_1_2_1_8_1","unstructured":"Stef Graillat Fabienne J\u00e9z\u00e9quel Romain Picot Fran\u00e7ois F\u00e9votte and Bruno Lathuili\u00e8re. 2016. Auto-tuning for floating-point precision with discrete stochastic arithmetic. https:\/\/hal.archives-ouvertes.fr\/hal-01331917.  Stef Graillat Fabienne J\u00e9z\u00e9quel Romain Picot Fran\u00e7ois F\u00e9votte and Bruno Lathuili\u00e8re. 2016. Auto-tuning for floating-point precision with discrete stochastic arithmetic. https:\/\/hal.archives-ouvertes.fr\/hal-01331917."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3213846.3213862"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the International Conference on Machine Learning. 1737--1746","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep learning with limited numerical precision . In Proceedings of the International Conference on Machine Learning. 1737--1746 . Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the International Conference on Machine Learning. 1737--1746."},{"key":"e_1_2_1_11_1","volume-title":"Higham","author":"Haidar Azzam","year":"2018","unstructured":"Azzam Haidar , Stanimire Tomov , Jack Dongarra , and Nicholas J . Higham . 2018 . Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press , 47. Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 47."},{"key":"e_1_2_1_12_1","unstructured":"Mark Harris. 2016. Mixed-Precision Programming with CUDA 8. Retrieved from https:\/\/devblogs.nvidia.com\/mixed-precision-programming-cuda-8\/.  Mark Harris. 2016. Mixed-Precision Programming with CUDA 8. Retrieved from https:\/\/devblogs.nvidia.com\/mixed-precision-programming-cuda-8\/."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASPDAC.2017.7858297"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.23919\/DATE51398.2021.9473933"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2019.8714785"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2017.8091072"},{"key":"e_1_2_1_17_1","unstructured":"Google Inc. 2020. Google Cloud Platform. Retrieved from https:\/\/cloud.google.com\/nvidia\/.  Google Inc. 2020. Google Cloud Platform. Retrieved from https:\/\/cloud.google.com\/nvidia\/."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/3168451.3168462"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783722"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Ian Karlin Jeff Keasler and J. R. Neely. 2013. Lulesh 2.0 Updates and Changes. Technical Report. Lawrence Livermore National Lab (LLNL) Livermore CA.  Ian Karlin Jeff Keasler and J. R. Neely. 2013. Lulesh 2.0 Updates and Changes. Technical Report. Lawrence Livermore National Lab (LLNL) Livermore CA.","DOI":"10.2172\/1090032"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3330345.3330360"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-20656-7_12"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2006.873887"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2925426.2926255"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the International Conference on Machine Learning. 2849--2858","author":"Lin Darryl","year":"2016","unstructured":"Darryl Lin , Sachin Talathi , and Sreekanth Annapureddy . 2016 . Fixed point quantization of deep convolutional networks . In Proceedings of the International Conference on Machine Learning. 2849--2858 . Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In Proceedings of the International Conference on Machine Learning. 2849--2858."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization. ACM, 278--287","author":"Maier Daniel","year":"2018","unstructured":"Daniel Maier , Biagio Cosenza , and Ben Juurlink . 2018 . Local memory-aware kernel perforation . In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 278--287 . Daniel Maier, Biagio Cosenza, and Ben Juurlink. 2018. Local memory-aware kernel perforation. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 278--287."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00051"},{"key":"e_1_2_1_28_1","volume-title":"et\u00a0al","author":"Micikevicius Paulius","year":"2017","unstructured":"Paulius Micikevicius , Sharan Narang , Jonah Alben , Gregory Diamos , Erich Elsen , David Garcia , Boris Ginsburg , Michael Houston , Oleksii Kuchaiev , Ganesh Venkatesh , et\u00a0al . 2017 . Mixed precision training. arXiv preprint arXiv:1710.03740 (2017). Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et\u00a0al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)."},{"key":"e_1_2_1_29_1","unstructured":"Paul Mineiro. 2017. fastapprox. Retrieved from https:\/\/github.com\/romeric\/fastapprox.  Paul Mineiro. 2017. fastapprox. Retrieved from https:\/\/github.com\/romeric\/fastapprox."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2893356"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316781.3317915"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3400302.3415637"},{"key":"e_1_2_1_33_1","unstructured":"Nvidia. 2020. CUDA half2 math functions. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-math-api\/.  Nvidia. 2020. CUDA half2 math functions. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-math-api\/."},{"key":"e_1_2_1_34_1","unstructured":"Nvidia. 2020. CUDA instruction throughput. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#arithmetic-instructions.  Nvidia. 2020. CUDA instruction throughput. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#arithmetic-instructions."},{"key":"e_1_2_1_35_1","unstructured":"Nvidia. 2020. CUDA Samples Reference Manual. Retrieved from https:\/\/docs.nvidia.com\/pdf\/CUDA_Samples.pdf.  Nvidia. 2020. CUDA Samples Reference Manual. Retrieved from https:\/\/docs.nvidia.com\/pdf\/CUDA_Samples.pdf."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357250"},{"key":"e_1_2_1_37_1","unstructured":"Matthew Robertson. 2012. A Brief History of invsqrt. Department of Computer Science 8 Applied Statistics.  Matthew Robertson. 2012. A Brief History of invsqrt. Department of Computer Science 8 Applied Statistics."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2884781.2884850"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503296"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540711"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993498.1993518"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/HiPCW.2018.8634417"},{"key":"e_1_2_1_43_1","unstructured":"Ian Stephenson. 2020. A Fast Power Function. Retrieved from http:\/\/www.dctsystems.co.uk\/Software\/power.html.  Ian Stephenson. 2020. A Fast Power Function. Retrieved from http:\/\/www.dctsystems.co.uk\/Software\/power.html."},{"key":"e_1_2_1_44_1","volume-title":"Geng Daniel Liu, and Wen-mei W. Hwu","author":"Stratton John A.","year":"2012","unstructured":"John A. Stratton , Christopher Rodrigues , I- Jui Sung , Nady Obeid , Li-Wen Chang , Nasser Anssari , Geng Daniel Liu, and Wen-mei W. Hwu . 2012 . Parboil : A revised benchmark suite for scientific and commercial throughput computing. Cent. Reliab. High-perf. Comput . 127 (2012). John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Cent. Reliab. High-perf. Comput. 127 (2012)."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2018.8342167"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2744769.2751163"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2016.7446063"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2015.2505723"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2016.2630270"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830810"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2015.2504899"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3441830","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3441830","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:30Z","timestamp":1750195470000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3441830"}},"subtitle":["A Framework for Dynamically Mixing Precisions in GPU Applications"],"short-title":[],"issued":{"date-parts":[[2021,2,9]]},"references-count":51,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2021,6,30]]}},"alternative-id":["10.1145\/3441830"],"URL":"https:\/\/doi.org\/10.1145\/3441830","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,2,9]]},"assertion":[{"value":"2020-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-02-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}