{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:21:49Z","timestamp":1750220509151,"version":"3.41.0"},"reference-count":80,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2021,1,20]],"date-time":"2021-01-20T00:00:00Z","timestamp":1611100800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100006602","name":"AFRL","doi-asserted-by":"crossref","award":["FA9550-18-1-0166"],"award-info":[{"award-number":["FA9550-18-1-0166"]}],"id":[{"id":"10.13039\/100006602","id-type":"DOI","asserted-by":"crossref"}]},{"name":"NSF","award":["CCF-1628384, CCF-1813434, and CCF-2010830"],"award-info":[{"award-number":["CCF-1628384, CCF-1813434, and CCF-2010830"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,3,31]]},"abstract":"<jats:p>Sequential consistency (SC) is the most intuitive memory consistency model and the easiest for programmers and hardware designers to reason about. However, the strict memory ordering restrictions imposed by SC make it less attractive from a performance standpoint. Additionally, prior high-performance SC implementations required complex hardware structures to support speculation and recovery.<\/jats:p>\n          <jats:p>\n            In this article, we introduce the lockstep SC consistency model (LSC), a new memory model based on SC but carefully defined to accommodate the data parallel lockstep execution paradigm of GPUs. We also describe an efficient LSC implementation for an APU system-on-chip (SoC) and show that our implementation performs close to the baseline relaxed model. Evaluation of our implementation shows that the geometric mean performance cost for lockstep SC is just 0.76% for GPU execution and 6.11% for the entire APU SoC compared to a baseline with a weaker memory consistency model. Adoption of LSC in future APU and SoC designs will reduce the burden on programmers trying to write correct parallel programs, while also simplifying the implementation and verification of systems with heterogeneous processing elements and complex memory hierarchies.\n            <jats:sup>1<\/jats:sup>\n          <\/jats:p>","DOI":"10.1145\/3428153","type":"journal-article","created":{"date-parts":[[2021,1,20]],"date-time":"2021-01-20T17:26:38Z","timestamp":1611163598000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Systems-on-Chip with Strong Ordering"],"prefix":"10.1145","volume":"18","author":[{"given":"Sooraj","family":"Puthoor","sequence":"first","affiliation":[{"name":"University of Wisconsin\u2014Madison, AMD Research, Southwest Pkwy, Austin"}]},{"given":"Mikko H.","family":"Lipasti","sequence":"additional","affiliation":[{"name":"University of Wisconsin\u2014Madison, Madison, WI"}]}],"member":"320","published-online":{"date-parts":[[2021,1,20]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2017. Inside Volta: The World\u2019s Most Advanced Data Center GPU. Retrieved from https:\/\/devblogs.nvidia.com\/inside-volta\/.  2017. Inside Volta: The World\u2019s Most Advanced Data Center GPU. Retrieved from https:\/\/devblogs.nvidia.com\/inside-volta\/."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.546611"},{"volume-title":"Proceedings of the 1990 International Conference on Parallel Processing. 47--50","author":"Sarita","key":"e_1_2_1_4_1","unstructured":"Sarita V. Adve and Mark D. Hill. 1990. Implementing sequential consistency in cache-based systems . In Proceedings of the 1990 International Conference on Parallel Processing. 47--50 . Sarita V. Adve and Mark D. Hill. 1990. Implementing sequential consistency in cache-based systems. In Proceedings of the 1990 International Conference on Parallel Processing. 47--50."},{"volume-title":"Proceedings of the 17th Annual International Symposium on Computer Architecture. 2--14","author":"Adve S. V.","key":"e_1_2_1_5_1","unstructured":"S. V. Adve and M. D. Hill . 1990. Weak ordering-a new definition . In Proceedings of the 17th Annual International Symposium on Computer Architecture. 2--14 . S. V. Adve and M. D. Hill. 1990. Weak ordering-a new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 2--14."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 17th Annual International Symposium on Computer Architecture. 2--14","author":"Adve S. V.","year":"1990","unstructured":"S. V. Adve and M. D. Hill . 1990. Weak ordering-a new definition . In Proceedings of the 17th Annual International Symposium on Computer Architecture. 2--14 . DOI:https:\/\/doi.org\/10.1109\/ISCA. 1990 .134502 10.1109\/ISCA.1990.134502 S. V. Adve and M. D. Hill. 1990. Weak ordering-a new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 2--14. DOI:https:\/\/doi.org\/10.1109\/ISCA.1990.134502"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2775054.2694391"},{"key":"e_1_2_1_8_1","unstructured":"Jade Alglave Daniel Kroening Vincent Nimal and Michael Tautschnig. 2012. Software verification for weak memory via program transformation. arxiv:1207.7264. Retrieved from http:\/\/arxiv.org\/abs\/1207.7264.  Jade Alglave Daniel Kroening Vincent Nimal and Michael Tautschnig. 2012. Software verification for weak memory via program transformation. arxiv:1207.7264. Retrieved from http:\/\/arxiv.org\/abs\/1207.7264."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-22110-1_6"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/2627752"},{"volume-title":"Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-49)","author":"Alsop Johnathan","key":"e_1_2_1_11_1","unstructured":"Johnathan Alsop , Marc S. Orr , Bradford M. Beckmann , and David A. Wood . 2016. Lazy release consistency for GPUs . In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-49) . IEEE Press, Piscataway, NJ, Article 26, 13 pages. Johnathan Alsop, Marc S. Orr, Bradford M. Beckmann, and David A. Wood. 2016. Lazy release consistency for GPUs. In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-49). IEEE Press, Piscataway, NJ, Article 26, 13 pages."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00031"},{"key":"e_1_2_1_13_1","unstructured":"AMD. [n.d.]. Compute Apps. Retrieved from https:\/\/github.com\/AMDComputeLibraries\/ComputeApps.  AMD. [n.d.]. Compute Apps. Retrieved from https:\/\/github.com\/AMDComputeLibraries\/ComputeApps."},{"key":"e_1_2_1_14_1","unstructured":"AMD. [n.d.]. HC. Retrieved from https:\/\/rocm.github.io\/languages.html.  AMD. [n.d.]. HC. Retrieved from https:\/\/rocm.github.io\/languages.html."},{"key":"e_1_2_1_15_1","unstructured":"AMD. [n.d.]. HCC Example Apps. Retrieved from https:\/\/github.com\/ROCm-Developer-Tools\/HCC-Example-Application.  AMD. [n.d.]. HCC Example Apps. Retrieved from https:\/\/github.com\/ROCm-Developer-Tools\/HCC-Example-Application."},{"key":"e_1_2_1_16_1","unstructured":"AMD. 2012. AMD Graphics Cores NEXT (GCN) Architecture. Retrieved from https:\/\/goo.gl\/GPvy8R.  AMD. 2012. AMD Graphics Cores NEXT (GCN) Architecture. Retrieved from https:\/\/goo.gl\/GPvy8R."},{"key":"e_1_2_1_17_1","unstructured":"AMD. 2016. AMD GCN3 ISA Architecture Manual. Retrieved from https:\/\/gpuopen.com\/compute-product\/amd-gcn3-isa-architecture-manual.  AMD. 2016. AMD GCN3 ISA Architecture Manual. Retrieved from https:\/\/gpuopen.com\/compute-product\/amd-gcn3-isa-architecture-manual."},{"key":"e_1_2_1_18_1","unstructured":"AMD. 2016. Dissecting the Polaris Architecture. Retrieved from https:\/\/goo.gl\/hNrZZo.  AMD. 2016. Dissecting the Polaris Architecture. Retrieved from https:\/\/goo.gl\/hNrZZo."},{"key":"e_1_2_1_19_1","unstructured":"AMD. 2019. User Guide for AMDGPU Backend. Retrieved from https:\/\/llvm.org\/docs\/AMDGPUUsage.html.  AMD. 2019. User Guide for AMDGPU Backend. Retrieved from https:\/\/llvm.org\/docs\/AMDGPUUsage.html."},{"key":"e_1_2_1_20_1","unstructured":"ARM. [n.d.]. ARM Architecture Reference Manual ARMv8 for ARMv8-A architecture profile. Retrieved from https:\/\/developer.arm.com\/docs\/ddi0487\/latest\/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile.  ARM. [n.d.]. ARM Architecture Reference Manual ARMv8 for ARMv8-A architecture profile. Retrieved from https:\/\/developer.arm.com\/docs\/ddi0487\/latest\/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1706299.1706303"},{"volume-title":"Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA\u201909)","author":"Blundell Colin","key":"e_1_2_1_22_1","unstructured":"Colin Blundell , Milo M. K. Martin , and Thomas F. Wenisch . 2009. InvisiFence: Performance-transparent memory ordering in conventional multiprocessors . In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA\u201909) . ACM, New York, NY, 233--244. DOI:https:\/\/doi.org\/10.1145\/1555754.1555785 10.1145\/1555754.1555785 Colin Blundell, Milo M. K. Martin, and Thomas F. Wenisch. 2009. InvisiFence: Performance-transparent memory ordering in conventional multiprocessors. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA\u201909). ACM, New York, NY, 233--244. DOI:https:\/\/doi.org\/10.1145\/1555754.1555785"},{"volume-title":"Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201908)","author":"J.","key":"e_1_2_1_23_1","unstructured":"Hans- J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ concurrency memory model . In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201908) . Association for Computing Machinery, New York, NY, 68--78. DOI:https:\/\/doi.org\/10.1145\/1375581.1375591 10.1145\/1375581.1375591 Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ concurrency memory model. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201908). Association for Computing Machinery, New York, NY, 68--78. DOI:https:\/\/doi.org\/10.1145\/1375581.1375591"},{"volume-title":"Proceedings of the 2014 IEEE Hot Chips 26 Symposium (HCS\u201914)","author":"Bouvier D.","key":"e_1_2_1_24_1","unstructured":"D. Bouvier and B. Sander . 2014. Applying AMD\u2019s Kaveri APU for heterogeneous computing . In Proceedings of the 2014 IEEE Hot Chips 26 Symposium (HCS\u201914) . D. Bouvier and B. Sander. 2014. Applying AMD\u2019s Kaveri APU for heterogeneous computing. In Proceedings of the 2014 IEEE Hot Chips 26 Symposium (HCS\u201914)."},{"volume-title":"Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201907)","author":"Burckhardt Sebastian","key":"e_1_2_1_25_1","unstructured":"Sebastian Burckhardt , Rajeev Alur , and Milo M. K. Martin . 2007. CheckFence: Checking consistency of concurrent data types on relaxed memory models . In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201907) . ACM, New York, NY, 12--21. DOI:https:\/\/doi.org\/10.1145\/1250734.1250737 10.1145\/1250734.1250737 Sebastian Burckhardt, Rajeev Alur, and Milo M. K. Martin. 2007. CheckFence: Checking consistency of concurrent data types on relaxed memory models. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201907). ACM, New York, NY, 12--21. DOI:https:\/\/doi.org\/10.1145\/1250734.1250737"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-70545-1_12"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2001420.2001436"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2004.81"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250662.1250697"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"volume-title":"Proceedings of the 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"ElTantawy A.","key":"e_1_2_1_31_1","unstructured":"A. ElTantawy and T. M. Aamodt . 2016. MIMD synchronization on SIMT architectures . In Proceedings of the 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916) . 1--14. A. ElTantawy and T. M. Aamodt. 2016. MIMD synchronization on SIMT architectures. In Proceedings of the 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). 1--14."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2837614.2837615"},{"key":"e_1_2_1_33_1","unstructured":"HSA Foundation. 2016. HSA Platform System Architecture Specification 1.1. Retrieved from http:\/\/www.hsafoundation.com\/?ddownload=5114.  HSA Foundation. 2016. HSA Platform System Architecture Specification 1.1. Retrieved from http:\/\/www.hsafoundation.com\/?ddownload=5114."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.30"},{"volume-title":"Proceedings of the 2011 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911)","author":"Fung W. W. L.","key":"e_1_2_1_35_1","unstructured":"W. W. L. Fung , I. Singh , A. Brownsword , and T. M. Aamodt . 2011. Hardware transactional memory for GPU architectures . In Proceedings of the 2011 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911) . 296--307. W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt. 2011. Hardware transactional memory for GPU architectures. In Proceedings of the 2011 44th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201911). 296--307."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2701618"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the 1991 International Conference on Parallel Processing. 355--364","author":"Gharachorloo Kourosh","year":"1991","unstructured":"Kourosh Gharachorloo , Anoop Gupta , and John Hennessy . 1991 . Two techniques to enhance the performance of memory consistency models . In Proceedings of the 1991 International Conference on Parallel Processing. 355--364 . Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. 1991. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing. 355--364."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/325164.325102"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2002.1106016"},{"volume-title":"Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA\u201999)","author":"Gniady Chris","key":"e_1_2_1_40_1","unstructured":"Chris Gniady , Babak Falsafi , and T. N. Vijaykumar . 1999. Is SC + ILP = RC? In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA\u201999) . IEEE Computer Society, Washington, DC, 162--171. DOI:https:\/\/doi.org\/10.1145\/300979.300993 10.1145\/300979.300993 Chris Gniady, Babak Falsafi, and T. N. Vijaykumar. 1999. Is SC + ILP = RC? In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA\u201999). IEEE Computer Society, Washington, DC, 162--171. DOI:https:\/\/doi.org\/10.1145\/300979.300993"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835950"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00058"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835930"},{"volume-title":"Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA\u201913)","author":"Blake","key":"e_1_2_1_44_1","unstructured":"Blake A. Hechtman and Daniel J. Sorin. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors . In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA\u201913) . ACM, New York, NY, 201--212. DOI:https:\/\/doi.org\/10.1145\/2485922.2485940 10.1145\/2485922.2485940 Blake A. Hechtman and Daniel J. Sorin. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA\u201913). ACM, New York, NY, 201--212. DOI:https:\/\/doi.org\/10.1145\/2485922.2485940"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.707614"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914)","author":"Hower Derek R.","year":"1940","unstructured":"Derek R. Hower , Blake A. Hechtman , Bradford M. Beckmann , Benedict R. Gaster , Mark D. Hill , Steven K. Reinhardt , and David A. Wood . 2014. Heterogeneous-race-free memory models . In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914) . ACM, New York, NY, 427--440. DOI:https:\/\/doi.org\/10.1145\/254 1940 .2541981 10.1145\/2541940.2541981 Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free memory models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS\u201914). ACM, New York, NY, 427--440. DOI:https:\/\/doi.org\/10.1145\/2541940.2541981"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2016.24"},{"key":"e_1_2_1_48_1","first-page":"9","article-title":"How to make a multiprocessor computer that correctly executes multiprocess programs","volume":"28","author":"Lamport L.","year":"1979","unstructured":"L. Lamport . 1979 . How to make a multiprocessor computer that correctly executes multiprocess programs . IEEE Trans. Comput. C28 , 9 (Sept. 1979), 690--691. DOI:https:\/\/doi.org\/10.1109\/TC.1979.1675439 10.1109\/TC.1979.1675439 L. Lamport. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. C28, 9 (Sept. 1979), 690--691. DOI:https:\/\/doi.org\/10.1109\/TC.1979.1675439","journal-title":"IEEE Trans. Comput."},{"volume-title":"Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201918)","author":"LeBeane Michael","key":"e_1_2_1_49_1","unstructured":"Michael LeBeane , Khaled Hamidouche , Brad Benton , Mauricio Breternitz , Steven K. Reinhardt , and Lizy K. John . 2018. ComP-net: Command processor networking for efficient intra-kernel communications on GPUs . In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201918) . ACM, New York, NY, Article 29, 13 pages. DOI:https:\/\/doi.org\/10.1145\/3243176.3243179 10.1145\/3243176.3243179 Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K. Reinhardt, and Lizy K. John. 2018. ComP-net: Command processor networking for efficient intra-kernel communications on GPUs. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201918). ACM, New York, NY, Article 29, 13 pages. DOI:https:\/\/doi.org\/10.1145\/3243176.3243179"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"volume-title":"Proceedings of the 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201910)","author":"Lin C.","key":"e_1_2_1_51_1","unstructured":"C. Lin , V. Nagarajan , and R. Gupta . 2010. Efficient sequential consistency using conditional fences . In Proceedings of the 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201910) . 295--306. C. Lin, V. Nagarajan, and R. Gupta. 2010. Efficient sequential consistency using conditional fences. In Proceedings of the 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201910). 295--306."},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/2150976.2151006"},{"key":"e_1_2_1_53_1","unstructured":"Yuan Lin and Vinod Grover. 2018. Using CUDA Warp-Level Primitives. Retrieved from https:\/\/developer.nvidia.com\/blog\/using-cuda-warp-level-primitives\/.  Yuan Lin and Vinod Grover. 2018. Using CUDA Warp-Level Primitives. Retrieved from https:\/\/developer.nvidia.com\/blog\/using-cuda-warp-level-primitives\/."},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304043"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304043"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-31424-7_36"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/1105734.1105747"},{"key":"e_1_2_1_58_1","unstructured":"Microsoft. [n.d.]. C++ AMP : Language and Programming Model. Retrieved from http:\/\/download.microsoft.com\/download\/2\/2\/9\/22972859-15c2-4d96-97ae-93344241d56c\/cppampopenspecificationv12.pdf.  Microsoft. [n.d.]. C++ AMP : Language and Programming Model. Retrieved from http:\/\/download.microsoft.com\/download\/2\/2\/9\/22972859-15c2-4d96-97ae-93344241d56c\/cppampopenspecificationv12.pdf."},{"key":"e_1_2_1_59_1","unstructured":"NVIDIA. 2020. CUDA C++ Programming Guide. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html.  NVIDIA. 2020. CUDA C++ Programming Guide. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html."},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/2678373.2665701"},{"volume-title":"Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-46)","author":"Power Jason","key":"e_1_2_1_61_1","unstructured":"Jason Power , Arkaprava Basu , Junli Gu , Sooraj Puthoor , Bradford M. Beckmann , Mark D. Hill , Steven K. Reinhardt , and David A. Wood . 2013. Heterogeneous system coherence for integrated CPU-GPU systems . In Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-46) . ACM, New York, NY, 457--467. DOI:https:\/\/doi.org\/10.1145\/2540708.2540747 10.1145\/2540708.2540747 Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 457--467. DOI:https:\/\/doi.org\/10.1145\/2540708.2540747"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/3158107"},{"volume-title":"Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201918)","author":"Puthoor Sooraj","key":"e_1_2_1_63_1","unstructured":"Sooraj Puthoor and Mikko H. Lipasti . 2018. Compiler assisted coalescing . In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201918) . Association for Computing Machinery, New York, NY, Article 11, 11 pages. DOI:https:\/\/doi.org\/10.1145\/3243176.3243203 10.1145\/3243176.3243203 Sooraj Puthoor and Mikko H. Lipasti. 2018. Compiler assisted coalescing. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201918). Association for Computing Machinery, New York, NY, Article 11, 11 pages. DOI:https:\/\/doi.org\/10.1145\/3243176.3243203"},{"volume-title":"Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA\u201997)","author":"Ranganathan Parthasarathy","key":"e_1_2_1_64_1","unstructured":"Parthasarathy Ranganathan , Vijay S. Pai , and Sarita V. Adve . 1997. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models . In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA\u201997) . ACM, New York, NY, 199--210. DOI:https:\/\/doi.org\/10.1145\/258492.258512 10.1145\/258492.258512 Parthasarathy Ranganathan, Vijay S. Pai, and Sarita V. Adve. 1997. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA\u201997). ACM, New York, NY, 199--210. DOI:https:\/\/doi.org\/10.1145\/258492.258512"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.40"},{"key":"e_1_2_1_66_1","unstructured":"Ben Sander. 2016. AMD GCN Assembly: Cross-Lane Operations. Retrieved from https:\/\/gpuopen.com\/learn\/amd-gcn-assembly-cross-lane-operations\/.  Ben Sander. 2016. AMD GCN Assembly: Cross-Lane Operations. Retrieved from https:\/\/gpuopen.com\/learn\/amd-gcn-assembly-cross-lane-operations\/."},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/1993498.1993520"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/1785414.1785443"},{"volume-title":"Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48)","author":"Sinclair Matthew D.","key":"e_1_2_1_69_1","unstructured":"Matthew D. Sinclair , Johnathan Alsop , and Sarita V. Adve . 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models . In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48) . ACM, New York, NY, 647--659. DOI:https:\/\/doi.org\/10.1145\/2830772.2830821 10.1145\/2830772.2830821 Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, 647--659. DOI:https:\/\/doi.org\/10.1145\/2830772.2830821"},{"volume-title":"Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)","author":"Sinclair M. D.","key":"e_1_2_1_70_1","unstructured":"M. D. Sinclair , J. Alsop , and S. V. Adve . 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models . In Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915) . 647--659. DOI:https:\/\/doi.org\/10.1145\/2830772.2830821 10.1145\/2830772.2830821 M. D. Sinclair, J. Alsop, and S. V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915). 647--659. DOI:https:\/\/doi.org\/10.1145\/2830772.2830821"},{"volume-title":"Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915)","author":"Singh A.","key":"e_1_2_1_71_1","unstructured":"A. Singh , S. Aga , and S. Narayanasamy . 2015. Efficiently enforcing strong memory ordering in GPUs . In Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915) . 699--712. DOI:https:\/\/doi.org\/10.1145\/2830772.2830778 10.1145\/2830772.2830778 A. Singh, S. Aga, and S. Narayanasamy. 2015. Efficiently enforcing strong memory ordering in GPUs. In Proceedings of the 2015 48th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201915). 699--712. DOI:https:\/\/doi.org\/10.1145\/2830772.2830778"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.5555\/2337159.2337220"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522351"},{"key":"e_1_2_1_74_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522351"},{"key":"e_1_2_1_75_1","volume-title":"Wood","author":"Sorin Daniel J.","year":"2011","unstructured":"Daniel J. Sorin , Mark D. Hill , and David A . Wood . 2011 . A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan 8 Claypool Publishers . Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan 8 Claypool Publishers."},{"key":"e_1_2_1_76_1","unstructured":"SPARC International Inc. 1994. The SPARC Architecture Manual (Version 9). Prentice-Hall Upper Saddle River NJ.  SPARC International Inc. 1994. The SPARC Architecture Manual (Version 9). Prentice-Hall Upper Saddle River NJ."},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00042"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1145\/1273440.1250696"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669166"},{"key":"e_1_2_1_80_1","unstructured":"Sizhuo Zhang Arvind and Muralidaran Vijayaraghavan. 2016. Taming weak memory models. arxiv:1606.05416. Retrieved from http:\/\/arxiv.org\/abs\/1606.05416.  Sizhuo Zhang Arvind and Muralidaran Vijayaraghavan. 2016. Taming weak memory models. arxiv:1606.05416. Retrieved from http:\/\/arxiv.org\/abs\/1606.05416."},{"key":"e_1_2_1_81_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00021"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3428153","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3428153","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3428153","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:24:23Z","timestamp":1750195463000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3428153"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,20]]},"references-count":80,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,3,31]]}},"alternative-id":["10.1145\/3428153"],"URL":"https:\/\/doi.org\/10.1145\/3428153","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2021,1,20]]},"assertion":[{"value":"2020-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-01-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}