{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:27:25Z","timestamp":1763458045013,"version":"3.41.0"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2021,7,17]],"date-time":"2021-07-17T00:00:00Z","timestamp":1626480000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2021,12,31]]},"abstract":"<jats:p>Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU\u2019s die. This article investigates such monolithically integrated CPU\u2013main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU\u2019s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC\/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests.<\/jats:p>\n          <jats:p>We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core\u2019s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC\u2013main memory area at the expense of slight increases in delay and energy. The streamlined LLC\/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3\u00d7 and 1.7\u00d7 over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system\u2019s energy by 6.0\u00d7 and 1.7\u00d7, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC\/main memory interface incurs a small 4% performance penalty.<\/jats:p>","DOI":"10.1145\/3462632","type":"journal-article","created":{"date-parts":[[2021,7,17]],"date-time":"2021-07-17T10:05:22Z","timestamp":1626516322000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache"],"prefix":"10.1145","volume":"18","author":[{"given":"Candace","family":"Walden","sequence":"first","affiliation":[{"name":"University of Maryland, College Park"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Devesh","family":"Singh","sequence":"additional","affiliation":[{"name":"University of Maryland, College Park"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Meenatchi","family":"Jagasivamani","sequence":"additional","affiliation":[{"name":"University of Maryland, College Park"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shang","family":"Li","sequence":"additional","affiliation":[{"name":"University of Maryland, College Park"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Luyi","family":"Kang","sequence":"additional","affiliation":[{"name":"University of Maryland, College Park"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mehdi","family":"Asnaashari","sequence":"additional","affiliation":[{"name":"Crossbar Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sylvain","family":"Dubois","sequence":"additional","affiliation":[{"name":"Crossbar Inc."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bruce","family":"Jacob","sequence":"additional","affiliation":[{"name":"University of Maryland, College Park"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Donald","family":"Yeung","sequence":"additional","affiliation":[{"name":"University of Maryland, College Park"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,7,17]]},"reference":[{"volume-title":"Proceedings of the International Symposium on Architectural Support for Programming Languages and Operating Systems.","author":"Agarwal Neha","key":"e_1_2_1_1_1","unstructured":"Neha Agarwal and Thomas F. Wenisch . 2017. Thermostat: Application-transparent page management for two-tiered main memory . In Proceedings of the International Symposium on Architectural Support for Programming Languages and Operating Systems. Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the International Symposium on Architectural Support for Programming Languages and Operating Systems."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2015.11"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2015.376"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2018.2882603"},{"key":"e_1_2_1_5_1","volume-title":"Retrieved","author":"Bader David A.","year":"2006","unstructured":"David A. Bader , John Feo , John Gilbert , Jeremy Kepner , David Koester , Eugene Loh , Kamesh Madduri , Bill Mann , and Theresa Meuse . 2006 . HPCS Scalable Synthetic Compact Applications #2 Graph Analysis . Retrieved May 28, 2021 from http:\/\/www.graphanalysis.org\/benchmark\/HPCS-SSCA2_Graph-Theory_v2.1.pdf. David A. Bader, John Feo, John Gilbert, Jeremy Kepner, David Koester, Eugene Loh, Kamesh Madduri, Bill Mann, and Theresa Meuse. 2006. HPCS Scalable Synthetic Compact Applications #2 Graph Analysis. Retrieved May 28, 2021 from http:\/\/www.graphanalysis.org\/benchmark\/HPCS-SSCA2_Graph-Theory_v2.1.pdf."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201911)","author":"Carlson T. E.","year":"2063","unstructured":"T. E. Carlson , W. Heirman , and L. Eeckhout . 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation . In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201911) . 1\u201312. DOI:https:\/\/doi.org\/10.1145\/ 2063 384.2063454 T. E. Carlson, W. Heirman, and L. Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201911). 1\u201312. DOI:https:\/\/doi.org\/10.1145\/2063384.2063454"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2629677"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972740.43"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCT.1971.1083337"},{"key":"e_1_2_1_11_1","unstructured":"Crossbar. 2017. ReRAM Memory Crossbar. https:\/\/www.crossbar-inc.com\/assets\/resources\/white-papers\/Crossbar-ReRAM-Technology.pdf.  Crossbar. 2017. ReRAM Memory Crossbar. https:\/\/www.crossbar-inc.com\/assets\/resources\/white-papers\/Crossbar-ReRAM-Technology.pdf."},{"key":"e_1_2_1_12_1","unstructured":"Crossbar. 2020. Personal communication.  Crossbar. 2020. Personal communication."},{"key":"e_1_2_1_13_1","volume-title":"Retrieved","author":"Cutress Ian","year":"2015","unstructured":"Ian Cutress . 2015 . SuperComputing 15: Intel\u2019s Knights Landing\/Xeon Phi Silicon on Display . Retrieved May 28, 2021 from https:\/\/www.anandtech.com\/show\/9802\/supercomputing-15-intels-knights-landing-xeon-phi-silicon-on-display. Ian Cutress. 2015. SuperComputing 15: Intel\u2019s Knights Landing\/Xeon Phi Silicon on Display. Retrieved May 28, 2021 from https:\/\/www.anandtech.com\/show\/9802\/supercomputing-15-intels-knights-landing-xeon-phi-silicon-on-display."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629911.1630086"},{"key":"e_1_2_1_15_1","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1109\/TCAD.2012.2185930","article-title":"NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory","volume":"31","author":"Dong Xiangyu","year":"2012","unstructured":"Xiangyu Dong , Cong Xu , Yuan Xie , and Norman P. Jouppi . 2012 . NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory . IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31 , 7 (July 2012), 994\u20131007. Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. 2012. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (July 2012), 994\u20131007.","journal-title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2901318.2901344"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3054021"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the International Conference on Parallel Processing. 312\u2013321","author":"Gupta Anoop","year":"1990","unstructured":"Anoop Gupta , Wolf Dietrich Weber , and Todd Mowry . 1990 . Reducing memory and traffic requirements for scalable directory-based cache coherence schemes . In Proceedings of the International Conference on Parallel Processing. 312\u2013321 . Anoop Gupta, Wolf Dietrich Weber, and Todd Mowry. 1990. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing. 312\u2013321."},{"key":"e_1_2_1_19_1","volume-title":"The Inquirer","author":"Demerjian Charlie","year":"2004","unstructured":"Charlie Demerjian . 2004 . Sun\u2019s Niagara falls neatly into multithreaded place . The Inquirer , 02 November 2004. Charlie Demerjian. 2004. Sun\u2019s Niagara falls neatly into multithreaded place. The Inquirer, 02 November 2004."},{"volume-title":"Retrieved","year":"2012","key":"e_1_2_1_20_1","unstructured":"Intel. 2012 . Intel Software Development Emulator . Retrieved May 28, 2021 from http:\/\/software.intel.com\/en-us\/articles\/intel-software-development-emulator. Intel. 2012. Intel Software Development Emulator. Retrieved May 28, 2021 from http:\/\/software.intel.com\/en-us\/articles\/intel-software-development-emulator."},{"volume-title":"Retrieved","year":"2017","key":"e_1_2_1_21_1","unstructured":"Intel. 2017 . AVX 512 Instruction Extensions . Retrieved May 28, 2021 from http:\/\/software.intel.com\/en-us\/blogs\/2013\/avx-512-instructions. Intel. 2017. AVX 512 Instruction Extensions. Retrieved May 28, 2021 from http:\/\/software.intel.com\/en-us\/blogs\/2013\/avx-512-instructions."},{"volume-title":"Retrieved","year":"2017","key":"e_1_2_1_22_1","unstructured":"Intel. 2017 . Intel Optane Technology . Retrieved May 28, 2021 from http:\/\/www.intel.com\/content\/www\/us\/en\/architecture-and-technology\/intel-optane-technology.html. Intel. 2017. Intel Optane Technology. Retrieved May 28, 2021 from http:\/\/www.intel.com\/content\/www\/us\/en\/architecture-and-technology\/intel-optane-technology.html."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/COOLCHIPS49199.2020.9097632"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2019.2944335"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357526.3357561"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1021\/nl8037689"},{"volume-title":"Proceedings of the IEEE International Electron Devices Meeting.","author":"Jo Sung Hyun","key":"e_1_2_1_27_1","unstructured":"Sung Hyun Jo , T. Kumar , S. Narayanan , W. D. Lu , and H. Nazarian . 2014. 3D-stackable crossbar resistive memory based on field assisted superlinear threshold (FAST) selector . In Proceedings of the IEEE International Electron Devices Meeting. Sung Hyun Jo, T. Kumar, S. Narayanan, W. D. Lu, and H. Nazarian. 2014. 3D-stackable crossbar resistive memory based on field assisted superlinear threshold (FAST) selector. In Proceedings of the IEEE International Electron Devices Meeting."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/54.922799"},{"volume-title":"Proceedings of the International Symposium on Computer Architecture. 140\u2013151","author":"Kelm John H.","key":"e_1_2_1_29_1","unstructured":"John H. Kelm , Daniel R. Johnson , Matthew R. Johnson , Neal C. Crago , William Tuohy , Aqeel Mahesri , Steven S. Lumetta , Matthew I. Frank , and Sanjay J. Patel . 2009. Rigel: An architecture and scalable programming interface for a 1000-core accelerator . In Proceedings of the International Symposium on Computer Architecture. 140\u2013151 . John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2009. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proceedings of the International Symposium on Computer Architecture. 140\u2013151."},{"key":"e_1_2_1_30_1","volume-title":"Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, et\u00a0al.","author":"Lee Myoung-Jae","year":"2011","unstructured":"Myoung-Jae Lee , Chang Bum Lee , Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, et\u00a0al. 2011 . A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O5-x\/TaO2-x bilayer structures. Nature Materials 10 (Aug. 2011), 625\u2013630. Myoung-Jae Lee, Chang Bum Lee, Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, et\u00a0al. 2011. A fast, high-endurance and scalable non-volatile memory device made from asymmetric Ta2O5-x\/TaO2-x bilayer structures. Nature Materials 10 (Aug. 2011), 625\u2013630."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2013.98"},{"key":"e_1_2_1_32_1","volume-title":"Yuhwan Ro, Nam Sung Kim, and Jung Ho Ahn.","author":"Lee Sukhan","year":"2018","unstructured":"Sukhan Lee , HyunYoon Cho , Young Hoon Son , Yuhwan Ro, Nam Sung Kim, and Jung Ho Ahn. 2018 . Leveraging power-performance relationship of energy-efficient modern DRAM devices. IEEE Access 6 (June 2018), 31387\u201331398. Sukhan Lee, HyunYoon Cho, Young Hoon Son, Yuhwan Ro, Nam Sung Kim, and Jung Ho Ahn. 2018. Leveraging power-performance relationship of energy-efficient modern DRAM devices. IEEE Access 6 (June 2018), 31387\u201331398."},{"volume-title":"Proceedings of the International Symposium on Microarchitecture.","author":"Li Sheng","key":"e_1_2_1_33_1","unstructured":"Sheng Li , Jung Ho Ahn , Richard D. Strong , Jay B. Brockman , Dean M. Tullsen , and Norman P. Jouppi . 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures . In Proceedings of the International Symposium on Microarchitecture. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2020.2973991"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065010.1065034"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2010.5416635"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.30"},{"volume-title":"Proceedings of the 50th International Symposium on Microarchitecture.","author":"O\u2019Connor Mike","key":"e_1_2_1_38_1","unstructured":"Mike O\u2019Connor , Niladrish Chatterjee , Donghyuk Lee , John Wilson , Aditya Agrawal , Stephen W. Keckler , and William J. Dally . 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems . In Proceedings of the 50th International Symposium on Microarchitecture. Mike O\u2019Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 50th International Symposium on Microarchitecture."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669117"},{"volume-title":"Proceedings of the International Symposium on Computer Architecture.","author":"Qureshi Moinuddin K.","key":"e_1_2_1_40_1","unstructured":"Moinuddin K. Qureshi , Vijayalakshmi, and Jude A. Rivers . 2009. Scalable high performance main memory system using phase-change memory technology . In Proceedings of the International Symposium on Computer Architecture. Moinuddin K. Qureshi, Vijayalakshmi, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the International Symposium on Computer Architecture."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995896.1995911"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2011.18"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485963"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1360612.1360617"},{"key":"e_1_2_1_45_1","volume-title":"Jung Ho Ahn, and Norman Jouppi","author":"Thoziyoor Shyamkumar","year":"2008","unstructured":"Shyamkumar Thoziyoor , Naveen Muralimanohar , Jung Ho Ahn, and Norman Jouppi . 2008 . CACTI 5.1. Technical Report. HP Laboratories . Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman Jouppi. 2008. CACTI 5.1. Technical Report. HP Laboratories."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3240302.3240310"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056056"},{"volume-title":"Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA\u201911)","author":"Yoon D. H.","key":"e_1_2_1_48_1","unstructured":"D. H. Yoon , M. K. Jeong , and M. Erez . 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput . In Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA\u201911) . 295\u2013306. D. H. Yoon, M. K. Jeong, and M. Erez. 2011. Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput. In Proceedings of the 2011 38th Annual International Symposium on Computer Architecture (ISCA\u201911). 295\u2013306."},{"volume-title":"Proceedings of the 43rd International Symposium on Computer Architecture.","author":"Zhang Lunkay","key":"e_1_2_1_49_1","unstructured":"Lunkay Zhang , Brian Neely , Diana Franklin , Dmitri Strukov , Yuan Xie , and Frederic T. Chong . 2016. Mellow writes: Extending lifetime in resistive memories through selective slow write backs . In Proceedings of the 43rd International Symposium on Computer Architecture. Lunkay Zhang, Brian Neely, Diana Franklin, Dmitri Strukov, Yuan Xie, and Frederic T. Chong. 2016. Mellow writes: Extending lifetime in resistive memories through selective slow write backs. In Proceedings of the 43rd International Symposium on Computer Architecture."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2009.30"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462632","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3462632","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:19:02Z","timestamp":1750191542000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462632"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,17]]},"references-count":50,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2021,12,31]]}},"alternative-id":["10.1145\/3462632"],"URL":"https:\/\/doi.org\/10.1145\/3462632","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2021,7,17]]},"assertion":[{"value":"2020-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}