{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,21]],"date-time":"2025-09-21T16:58:14Z","timestamp":1758473894030,"version":"3.41.0"},"reference-count":96,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2022,8,22]],"date-time":"2022-08-22T00:00:00Z","timestamp":1661126400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Science Foundation","award":["CCF 16-19245"],"award-info":[{"award-number":["CCF 16-19245"]}]},{"DOI":"10.13039\/100000185","name":"DARPA","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Domain-Specific System on Chip (DSSoC) program, a Google Faculty Research"},{"name":"Applications Driving Architectures (ADA) Research Center"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2022,9,30]]},"abstract":"<jats:p>Hardware specialization is becoming a key enabler of energy-efficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands. Traditionally, communication between accelerators has been inefficient, typically orchestrated through explicit DMA transfers between different address spaces. More recently, industry has proposed unified coherent memory which enables implicit data movement and more data reuse, but often these interfaces limit the coherence flexibility available to heterogeneous systems.<\/jats:p><jats:p>This paper demonstrates the benefits of fine-grained coherence specialization for heterogeneous systems. We propose an architecture that enables low-complexity independent specialization of each individual coherence request in heterogeneous workloads by building upon a simple and flexible baseline coherence interface, Spandex. We then describe how to optimize individual memory requests to improve cache reuse and performance-critical memory latency in emerging heterogeneous workloads. Collectively, our techniques enable significant gains, reducing execution time by up to 61% or network traffic by up to 99% while adding minimal complexity to the Spandex protocol.<\/jats:p>","DOI":"10.1145\/3530819","type":"journal-article","created":{"date-parts":[[2022,4,25]],"date-time":"2022-04-25T16:30:15Z","timestamp":1650904215000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["A Case for Fine-grain Coherence Specialization in Heterogeneous Systems"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5272-2396","authenticated-orcid":false,"given":"Johnathan","family":"Alsop","sequence":"first","affiliation":[{"name":"AMD Research, Bellevue, WA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1303-2461","authenticated-orcid":false,"given":"Weon Taek","family":"Na","sequence":"additional","affiliation":[{"name":"MIT, Cambridge, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0189-7895","authenticated-orcid":false,"given":"Matthew D.","family":"Sinclair","sequence":"additional","affiliation":[{"name":"University of Wisconsin - Madison, Madison, WI, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5411-356X","authenticated-orcid":false,"given":"Samuel","family":"Grayson","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3403-5119","authenticated-orcid":false,"given":"Sarita","family":"Adve","sequence":"additional","affiliation":[{"name":"University of Illinois at Urbana-Champaign, Urbana, IL, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,8,22]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Accel-Sim GPGPU-Sim configurations. https:\/\/github.com\/accel-sim\/gpgpu-sim_distribution\/tree\/dev\/configs\/tested-cfgs."},{"key":"e_1_3_2_3_2","unstructured":"2019. Compute Express Link: Breakthrough CPU-to-Device Interconnect. http:\/\/www.computeexpresslink.org. (2019)."},{"key":"e_1_3_2_4_2","doi-asserted-by":"crossref","first-page":"204","DOI":"10.1109\/HPCA.1997.569661","volume-title":"High-Performance Computer Architecture, 1997., Third International Symposium on","author":"Abdel-Shafi Hazim","year":"1997","unstructured":"Hazim Abdel-Shafi, Jonathan Hall, Sarita V. Adve, and Vikram S. Adve. 1997. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In High-Performance Computer Architecture, 1997., Third International Symposium on. IEEE, 204\u2013215."},{"key":"e_1_3_2_5_2","article-title":"Analyzing the performance of mutation operators to solve the travelling salesman problem","volume":"1203","author":"Abdoun Otman","year":"2012","unstructured":"Otman Abdoun, Jaafar Abouchabaka, and Chakir Tajani. 2012. Analyzing the performance of mutation operators to solve the travelling salesman problem. CoRR abs\/1203.3099 (2012). arXiv:1203.3099 http:\/\/arxiv.org\/abs\/1203.3099.","journal-title":"CoRR"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.5555\/762761.762762"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.5555\/193889"},{"key":"e_1_3_2_8_2","first-page":"33","volume-title":"ISPASS","author":"Agarwal Niket","year":"2009","unstructured":"Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In ISPASS. 33\u201342."},{"key":"e_1_3_2_9_2","volume-title":"Proc. High-Performance Graphics","author":"Aila Timo","year":"2009","unstructured":"Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Proc. High-Performance Graphics (2009)."},{"key":"e_1_3_2_10_2","first-page":"172","volume-title":"ISCA","author":"Alsop Johnathan","year":"2018","unstructured":"Johnathan Alsop, Matthew D. Sinclair, and Sarita V. Adve. 2018. Spandex: A flexible interface for efficient heterogeneous coherence. In ISCA. IEEE, 172\u2013182."},{"key":"e_1_3_2_11_2","series-title":"International Conference on Machine Learning","first-page":"173","author":"Amodei Dario","year":"2016","unstructured":"Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu. 2016. Deep speech 2: End-to-end speech recognition in English and Mandarin. In International Conference on Machine Learning(ICML\u201916). 173\u2013182."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.1996.501195"},{"key":"e_1_3_2_13_2","unstructured":"ARM. 2018. AMBA 5 CHI Architecture Specification. http:\/\/infocenter.arm.com\/help\/index.jsp?topic=\/com.arm.doc.ihi0050c\/index.html. (2018)."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378498"},{"key":"e_1_3_2_15_2","first-page":"163","volume-title":"ISPASS","author":"Bakhoda Ali","year":"2009","unstructured":"Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS. 163\u2013174."},{"key":"e_1_3_2_16_2","volume-title":"Tutorial at the 48th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Beckmann Bradford M.","year":"2015","unstructured":"Bradford M. Beckmann and Anthony Gutierrez. 2015. The AMD gem5 APU simulator: Modeling heterogeneous systems in gem5. In Tutorial at the 48th Annual IEEE\/ACM International Symposium on Microarchitecture."},{"key":"e_1_3_2_17_2","first-page":"213","volume-title":"Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques","author":"Beckmann Nathan","year":"2013","unstructured":"Nathan Beckmann and Daniel Sanchez. 2013. Jigsaw: Scalable software-defined caches. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 213\u2013224."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2004.09.004"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3370748.3406564"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/300979.301004"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.5555\/1116644.1116671"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/RADAR.2008.4720947"},{"key":"e_1_3_2_23_2","article-title":"Cache Coherent Interconnect for Accelerators (CCIX)","year":"2018","unstructured":"CCIX. 2018. Cache Coherent Interconnect for Accelerators (CCIX). http:\/\/www.ccixconsortium.com. (2018).","journal-title":"http:\/\/www.ccixconsortium.com"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSAA.2015.7344872"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.11"},{"key":"e_1_3_2_26_2","first-page":"328","volume-title":"High Performance Computer Architecture, (HPCA\u201907). IEEE 13th International Symposium on","author":"Cheng Liqun","year":"2007","unstructured":"Liqun Cheng, John B. Carter, and Donglai Dai. 2007. An adaptive cache coherence protocol optimized for producer-consumer sharing. In High Performance Computer Architecture, (HPCA\u201907). IEEE 13th International Symposium on. IEEE, 328\u2013339."},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.21"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/1281192.1281214"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/2959100.2959190"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.1993.698549"},{"issue":"3","key":"e_1_3_2_31_2","first-page":"26","article-title":"The effects of granularity and adaptivity on private\/shared classification for coherence","volume":"12","author":"Davari Mahdad","year":"2015","unstructured":"Mahdad Davari, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras. 2015. The effects of granularity and adaptivity on private\/shared classification for coherence. ACM Transactions on Architecture and Code Optimization (TACO\u201915) 12, 3 (2015), 26.","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO\u201915)"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-61474-5_86"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-51285-3_30"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555779"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2013.6557152"},{"key":"e_1_3_2_37_2","volume-title":"HPCA","author":"Hechtman Blake A.","year":"2014","unstructured":"Blake A. Hechtman, Shuai Che, Derek R. Hower, Yingying Tian, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In HPCA."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11515-8_3"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454138"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.11"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.1999.744359"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2010.82"},{"key":"e_1_3_2_43_2","volume-title":"2020 ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920)","author":"Khairy Mahmoud","year":"2020","unstructured":"Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920). IEEE."},{"key":"e_1_3_2_44_2","volume-title":"Exploiting Software Information for an Efficient Memory Hierarchy","author":"Komuravelli Rakesh","year":"2015","unstructured":"Rakesh Komuravelli. 2015. Exploiting Software Information for an Efficient Memory Hierarchy. Ph.D. Dissertation. University of Illinois at Urbana-Champaign."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/2663345"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080239"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/71.553274"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.42"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485967"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2000.854385"},{"key":"e_1_3_2_51_2","first-page":"48","volume-title":"ISCA","author":"Lebeck Alvin R.","year":"1995","unstructured":"Alvin R. Lebeck and David A. Wood. 1995. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In ISCA. 48\u201359."},{"key":"e_1_3_2_52_2","unstructured":"Yann LeCun. 2015. LeNet-5 convolutional neural networks. 20 (2015) http:\/\/yann.lecun.com\/exdb\/lenet."},{"key":"e_1_3_2_53_2","first-page":"53","volume-title":"International Conference on Artificial Neural Networks","volume":"60","author":"LeCun Yann","year":"1995","unstructured":"Yann LeCun, L. D. Jackel, Leon Bottou, A. Brunot, Corinna Cortes, J. S. Denker, Harris Drucker, I. Guyon, U. A. Muller, Eduard Sackinger, Patrice Simard, and Vladimir Vapnik. 1995. Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks, Vol. 60. Perth, Australia, 53\u201360."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.12"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2015.2424962"},{"key":"e_1_3_2_56_2","first-page":"553","volume-title":"2017 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917)","author":"Lu Wenyan","year":"2017","unstructured":"Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917). IEEE, 553\u2013564."},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/2.982916"},{"key":"e_1_3_2_58_2","article-title":"Agile SoC development with open ESP","author":"Mantovani Paolo","year":"2020","unstructured":"Paolo Mantovani, Davide Giri, Giuseppe Di Guglielmo, Luca Piccolboni, Joseph Zuckerman, Emilio G. Cota, Michele Petracca, Christian Pilato, and Luca P. Carloni. 2020. Agile SoC development with open ESP. arXiv preprint arXiv:2009.01178 (2020).","journal-title":"arXiv preprint arXiv:2009.01178"},{"key":"e_1_3_2_59_2","first-page":"182","volume-title":"ACM SIGARCH Computer Architecture News","author":"Martin Milo M. K.","year":"2003","unstructured":"Milo M. K. Martin, Mark D. Hill, and David A. Wood. 2003. Token coherence: Decoupling performance and correctness. In ACM SIGARCH Computer Architecture News, Vol. 31. ACM, 182\u2013193."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859642"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/1105734.1105747"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/COMST.2018.2844341"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1145\/279358.279386"},{"issue":"1","key":"e_1_3_2_64_2","first-page":"105","article-title":"Carrizo: A high performance, energy efficient 28 nm APU","volume":"51","author":"Munger Benjamin","year":"2016","unstructured":"Benjamin Munger, David Akeson, Srikanth Arekapudi, Tom Burd, Harry R. Fair, Jim Farrell, Dave Johnson, Guhan Krishnan, Hugh McIntyre, Edward McLellan, Samuel Naffziger, Russell Schreiber, Sriram Sundaram, Jonathan White, and Kathryn Wilcox. 2016. Carrizo: A high performance, energy efficient 28 nm APU. JSSC 51, 1 (2016), 105\u2013116.","journal-title":"JSSC"},{"key":"e_1_3_2_65_2","doi-asserted-by":"crossref","first-page":"5C2\u20131","DOI":"10.1109\/ICNSURV.2016.7486356","volume-title":"2016 Integrated Communications Navigation and Surveillance (ICNS\u201916)","author":"Nanduri Anvardh","year":"2016","unstructured":"Anvardh Nanduri and Lance Sherry. 2016. Anomaly detection in aircraft data using Recurrent Neural Networks (RNN). In 2016 Integrated Communications Navigation and Surveillance (ICNS\u201916). IEEE, 5C2\u20131."},{"key":"e_1_3_2_66_2","article-title":"Deep learning recommendation model for personalization and recommendation systems","author":"Naumov Maxim","year":"2019","unstructured":"Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).","journal-title":"arXiv preprint arXiv:1906.00091"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-58184-7_115"},{"key":"e_1_3_2_68_2","unstructured":"OpenCAPI. 2017. Welcome to OpenCAPI Consortium. http:\/\/www.opencapi.org. (2017)."},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465443"},{"key":"e_1_3_2_70_2","first-page":"2891","volume-title":"Advances in Neural Information Processing Systems","author":"Park Seunghyun","year":"2017","unstructured":"Seunghyun Park, Seonwoo Min, Hyun-Soo Choi, and Sungroh Yoon. 2017. Deep recurrent neural network-based identification of precursor microRNAs. In Advances in Neural Information Processing Systems. 2891\u20132900."},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.1994.81"},{"key":"e_1_3_2_72_2","article-title":"Network traffic anomaly detection using recurrent neural networks","author":"Radford Benjamin J.","year":"2018","unstructured":"Benjamin J. Radford, Leonardo M. Apolonio, Antonio J. Trias, and Jim A. Simpson. 2018. Network traffic anomaly detection using recurrent neural networks. arXiv preprint arXiv:1803.10769 (2018).","journal-title":"arXiv preprint arXiv:1803.10769"},{"key":"e_1_3_2_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/224170.224399"},{"key":"e_1_3_2_74_2","first-page":"1","volume-title":"2008 IEEE International Symposium on Parallel and Distributed Processing","author":"Ros Alberto","year":"2008","unstructured":"Alberto Ros, Manuel E. Acacio, and Jos\u00e9 M. Garc\u00eda. 2008. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1\u201311."},{"key":"e_1_3_2_75_2","first-page":"124","volume-title":"ISMASCTS","author":"Rothman Jeffrey B.","year":"2000","unstructured":"Jeffrey B. Rothman and Alan Jay Smith. 2000. Sector cache design and performance. In ISMASCTS. 124\u2013133."},{"key":"e_1_3_2_76_2","first-page":"299303","volume-title":"Proceedings of the 2011 International Conference on Circuits, System and Simulation, Singapore","volume":"2829","author":"Sapon Muhammad Akmal","year":"2011","unstructured":"Muhammad Akmal Sapon, Khadijah Ismail, and Suehazlyn Zainudin. 2011. Prediction of diabetes by using artificial neural network. In Proceedings of the 2011 International Conference on Circuits, System and Simulation, Singapore, Vol. 2829. 299303."},{"key":"e_1_3_2_77_2","volume-title":"Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques","author":"Sethia Ankit","year":"2013","unstructured":"Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive prefetching on GPUs for energy efficiency. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE Press."},{"key":"e_1_3_2_78_2","volume-title":"MICRO","author":"Sinclair Matthew D.","year":"2015","unstructured":"Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In MICRO."},{"key":"e_1_3_2_79_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522351"},{"key":"e_1_3_2_80_2","doi-asserted-by":"publisher","DOI":"10.1145\/1555815.1555766"},{"key":"e_1_3_2_81_2","doi-asserted-by":"publisher","DOI":"10.5555\/2028905"},{"issue":"6","key":"e_1_3_2_82_2","first-page":"228","article-title":"Whippletree: Task-based scheduling of dynamic workloads on the GPU","volume":"33","author":"Steinberger Markus","year":"2014","unstructured":"Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based scheduling of dynamic workloads on the GPU. ACM Transactions on Graphics (TOG\u201914) 33, 6 (2014), 228.","journal-title":"ACM Transactions on Graphics (TOG\u201914)"},{"key":"e_1_3_2_83_2","doi-asserted-by":"publisher","DOI":"10.1145\/173682.165147"},{"key":"e_1_3_2_84_2","volume-title":"IEEE International Symposium on Workload Characterization (IISWC\u201916)","author":"Sun Yifan","year":"2016","unstructured":"Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-mark, A benchmark suite for CPU-GPU collaborative computing. In IEEE International Symposium on Workload Characterization (IISWC\u201916)."},{"key":"e_1_3_2_85_2","article-title":"Welcome to The Gen-Z Consortium!","author":"Consortium The Gen-Z","year":"2017","unstructured":"The Gen-Z Consortium. 2017. Welcome to The Gen-Z Consortium! http:\/\/genzconsortium.org. (2017).","journal-title":"http:\/\/genzconsortium.org"},{"key":"e_1_3_2_86_2","doi-asserted-by":"publisher","DOI":"10.1145\/2716282.2716283"},{"issue":"6","key":"e_1_3_2_87_2","article-title":"False sharing and spatial locality in multiprocessor caches","volume":"43","author":"Torrellas Josep","year":"1994","unstructured":"Josep Torrellas, H. S. Lam, and John L. Hennessy. 1994. False sharing and spatial locality in multiprocessor caches. TOCS 43, 6 (1994).","journal-title":"TOCS"},{"key":"e_1_3_2_88_2","volume-title":"Supercomputing","author":"Trancoso Pedro","year":"1996","unstructured":"Pedro Trancoso and Josep Torrellas. 1996. The impact of speeding up critical sections with data prefetching and forwarding. In Supercomputing. ACM\/IEEE."},{"key":"e_1_3_2_89_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00074"},{"key":"e_1_3_2_90_2","volume-title":"2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA\u201918)","author":"Vijaykumar Nandita","year":"2018","unstructured":"Nandita Vijaykumar, Abhilasha Jain, Diptesh Majumdar, Kevin Hsieh, Gennady Pekhimenko, Eiman Ebrahimi, Nastaran Hajinazar, Phillip B. Gibbons, and Onur Mutlu. 2018. A case for richer cross-layer abstractions: Bridging the semantic gap with expressive memory. In 2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA\u201918). IEEE."},{"key":"e_1_3_2_91_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2002.1106018"},{"key":"e_1_3_2_92_2","doi-asserted-by":"publisher","DOI":"10.5555\/1233791.1233796"},{"key":"e_1_3_2_93_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.512"},{"key":"e_1_3_2_94_2","doi-asserted-by":"publisher","DOI":"10.1145\/1278177.1278183"},{"key":"e_1_3_2_95_2","first-page":"951","volume-title":"2018 \\( \\lbrace \\) USENIX \\( \\rbrace \\) Annual Technical Conference ( \\( \\lbrace \\) USENIX \\( \\rbrace \\) \\( \\lbrace \\) ATC \\( \\rbrace \\) 18)","author":"Zhang Minjia","year":"2018","unstructured":"Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. 2018. DeepCPU: Serving RNN-based deep learning models 10x faster. In 2018 \\( \\lbrace \\) USENIX \\( \\rbrace \\) Annual Technical Conference ( \\( \\lbrace \\) USENIX \\( \\rbrace \\) \\( \\lbrace \\) ATC \\( \\rbrace \\) 18). 951\u2013965."},{"key":"e_1_3_2_96_2","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485969"},{"key":"e_1_3_2_97_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123978"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3530819","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3530819","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3530819","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:25Z","timestamp":1750183765000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3530819"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,22]]},"references-count":96,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2022,9,30]]}},"alternative-id":["10.1145\/3530819"],"URL":"https:\/\/doi.org\/10.1145\/3530819","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2022,8,22]]},"assertion":[{"value":"2021-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-08-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}